REPRODUCED  AT  GOVERNMENT  EXPENSE 


This  document  has  been  approved 
for  public  release  and  sale;  its 
ai  tnbution  is  unlimited. 


PROCESSING  DYNAMIC  IMAGE  SEQUENCES 
FROM  A  MOVING  SENSOR 


Daryl  T.  Lawton 


COINS  Technical  Report  84-05 


DTIC 

ELECTEj 

7  1985  a 


*  lJ  iJ 


«T. 

' ;  ' !  *  •  .] 
l*  '*  **  m 


2ti on  Science 


Computers 

Theory  of  Computation 
Cybernetics 


at  Amherst 


PROCESSING  DYNAMIC  IMAGE  SEQUENCES 
FROM  A  MOVING  SENSOR 


Daryl  T.  Lawton 


COINS  Technical  Report  84-05 


February  1984 


of' 


Sic-reP 

FEB7  886  s 


T  A 


This  document  has  been  approved 

for  public  release  and  sale;  its 

diVuibutk-n  ».?  ur.lirr.il  .-d. 


This  work  was  supported  in  part  by  the  Office  of  Naval  Research  under  grant 
number  N00014-75-C-0459  and  the  Advanced  Research  Projects  Agency  under  grant 
number  N00014-82-K-0464. 


Processing  Dynamic  Image  Sequences 
from  a  Moving  Sensor 


A  Dissertation  Presented 

By 

DARYL  TALIESEN  LAWTON 


Submitted  to  the  Graduate  School  of  the 
University  of  Massachusetts  in  partial  fulfillment 
of  the  requirements  for  the  degree  of 

DOCTOR  OF  PHILOSOPHY 
February  1984 

Department  of  Computer  and  Information  Science 


Accession  For 

OPAfcl 

DTi'O  TAB 

v  •*icatlr» 

v  •  *  - 


©  Daryl  Taliesen  Lawton 


All  Rights  Reserved 


This  research  was  supported  in  part  by: 

The  Office  of  Naval  Research 
Grant  Number  N00014-75-C-0459 

and 

The  Advanced  Research  Projects  Agency 
Grant  Number  N00014-82-K-0464 


For  my  Genetic  Buddies, 
Sam,  Sarah,  Fritzie,  and  Dennis 


ACKNOWLEDGMENTS 


I  have  many  people  to  thank,  in  ways  more  special  than  are  possible  here.  I 
owe  a  tremendous  amount  to  my  principle  advisors,  Ed  Riseman  and  A1  Hanson  (or 
Ednal  to  many  of  us)  for  several  types  of  support  and  even  more  types  of  patience. 
Ed  in  particular  lovingly  badgered  me  through  some  periods  of  intense  laziness 
and  anguish.  Without  him,  this  thesis,  and  much  else  besides,  would  not  have  been 
completed.  I  would  also  like  to  thank  Nico  Spinelli  for  many  enjoyable  and  insightful 
discussions  and  to  Bob  Huguenin  for  his  cheerful  enthusiasm  and  also  putting  up 
with  being  referred  to,  somewhat  ambiguously,  as  my  outside  member. 

I  am  very  happy  and  proud  to  be  a  part  of  the  UMASS  VISIONS  and  MOTIONS 
groups  for  shared  experiences,  software,  and  a  wide  range  of  generally  excessive 
behavior.  I  would  particularly  like  to  thank  Joachim  Rieger,  Terry  Weymouth, 
Frank  Glazer,  Gilad  Adiv,  George  Reynolds,  P.  Anandan,  Janet  Turnbull,  Martha 
Steenstrup,  Ken  Overton,  Bert  Shaw,  Tom  Williams,  John  Prager,  Charles  Kohl, 
Ralf  Kohler,  Steve  Levitan,  Chip  Weems,  Steve  Epstein,  Kate  Greenspan,  and  the 
little  Tex  Master,  Lenny  Wesley. 

Dr.  Nelson  Corby  of  the  Machine  Intelligence  Laboratory  of  General  Electric 
in  Schenectady,  New  York  made  possible  some  of  the  industrial  image  sequences  I 
have  been  working  with. 


ABSTRACT 


Processing  Dynamic  Image  Sequences 
from  a  Moving  Sensor 

February,  1984 
Daryl  T.  Lawton 

B.S.,  University  of  California  at  Santa  Cruz 
M.S.,  Ph.D.,  University  of  Massachusetts  at  Amherst 
Directed  by:  Professor  Edward  M.  Riseman 


v  A  fundamental  problem  in  motion  processing  research  has  been  the  discrepancy 
between  the  precision  and  reliability  with  which  image  displacements  can  be  de¬ 
termined  and  the  sensitivity  of  inference  procedures  to  noise  and  resolution  errors. 
There  are  also  indications  that  these  inference  procedures  are  inherently  unstable 
and,  in  some  cases,  ambiguous.  The  approach  of  this  thesis  has  been  to  deal  with 
restricted  cases  of  motion  for  which  the  inference  of  the  motion  parameters,  image 
displacements,  and  environmental  depth,  can  be  combined  into  a  single,  uniform, 
and  mutually  constraining  computation.  These  restricted  cases  of  motion  are  suffi¬ 
cient  for  a  wide  range  of  real-world  tasks,  especially  since  other  associated  sensing 
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devices  can  be  used  to  ascertain  the  other  parameters  of  motion.  We  then  apply  the 
procedure  developed  for  translational  motion  to  local  portions  of  image  sequences 
to  process  general  sensor  motion  as  if  it  were  composed  of  independent  local  envi¬ 
ronmental  translations.  The  resulting  representation  can  considerably  simplify  the 


processing  of  less  restricted  and  general  motion. 
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The  procedure  for  processing  translational  motion  robustly  combines  the  de¬ 
termination  of  image  displacements  with  the  extraction  of  the  direction  of  sensor 
motion.  We  present  several  experiments  showing  its  behavior  in  a  variety  of  sit¬ 
uations.  We  also  consider  various  extensions  to  this  procedure  for  such  things  as 
developing  it  as  a  hierarchical  computation;  processing  translational  blur  patterns; 
dealing  with  multiple  independently  moving  objects;  and  using  the  translational 
procedure  in  the  control  of  an  autonomous  vehicle. 

Results  are  presented  for  two  other  restricted  cases  of  motion:  pure  sensor 
rotation  and  motion  constrained  to  a  known  plane.  The  results  are  similar  to  the 
translational  case  except  that  certain  simple  cases  of  planar  motion  are  found  to  be 
inherently  ambiguous. 

We  then  process  less  restricted  and  general  sensor  motion  by  applying  the  pro¬ 
cedure  for  translational  motion  processing  to  local  areas  of  images.  This  results  in  a 
low  level  description  of  motion  called  the  Environmental  Direction  of  Motion  Field 
(or  EDMF)  which  associates  a  direction  of  environmental  motion  with  extracted 
image  features.  This  representation  can  greatly  simplify  the  recovery  of  sensor  mo¬ 
tion  parameters.  We  also  develop  the  constraints  associated  with  object  rigidity  in 
determining  the  inference  of  sensor  motion  parameters,  and  then  show  how  these 
constraints  are  simplified  by  information  in  the  EDMF. 

We  conclude  with  a  summary  of  the  major  results  of  the  thesis  and  mention 
future  work,  chiefly  in  the  areas  of  architectures  for  real  time  motion  processing, 
and  applications  to  more  challenging  and  specific  domains. 
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INTRODUCTION 


The  importance  of  processing  dynamic  information  is  obvious.  Change  is  a  basic 
and  pervasive  aspect  of  reality.  Artificial  perceptual  systems  which  cannot  deal 
with  such  dynamic  information  will  be  severely  limited.  They  would  not  be  able  to 
determine  basic  causal  and  structural  relations  in  the  environment.  They  would  not 
be  able  to  move  about  and  directly  explore  the  world.  These  fundamental  concerns, 
coupled  with  recent  advances  in  sensor  technology  and  attainable  computing  power, 
have  made  image  motion  processing  an  active  area  of  research. 

The  work  in  dynamic  image  processing  can  be  roughly  divided  into  two  types 
of  techniques:  those  for  determining  the  changes  in  a  sequences  of  images  and  those 
for  inferring  environmental  information  from  these  transformations.  Much  basic 
work  has  been  done  on  determining  the  displacements  of  distinguishable  image 
points  over  time  and  inferring  sensor  motion  and  environmental  depth  from  these 
displacements.  A  fundamental  problem  that  has  emerged  in  all  this  work  is  the 
discrepancy  between  the  precision  and  reliability  with  which  image  displacements 
can  be  determined  and  the  sensitivity  of  the  inference  procedures  to  noise  and 
resolution  errors.  For  example,  some  of  the  inference  procedures  require  high  order 
derivatives  to  be  extracted  from  the  determined  image  displacements.  Additionally, 
there  are  indications  that  the  problem  itself  is  inherently  unstable  and,  in  some 
cases,  ambiguous.  This  has  lead  to  an  interesting  state  of  affairs:  formulations  which 
are  often  elegant  but  do  not  work  in  motion  processing  of  real  world  situations,  and 
therefore  have  limited  practical  application. 


The  approach  of  this  thesis  has  been  to  deal  with  restricted  cases  of  motion  for 
which  the  inference  of  the  motion  parameters,  image  displacements,  and,  to  some 
extent,  environmental  depth,  can  be  combined  into  a  single,  uniform,  and  mutually 
constraining  computation.  These  restricted  cases  of  motion  are  sufficient  for  a  wide 
range  of  real-world  tasks,  especially  since  other  associated  sensing  devices  can  be 
used  to  ascertain  the  other  parameters  of  motion.  Finally,  we  apply  the  procedure 
developed  for  translational  motion  to  local  portions  of  image  sequences  to  process 
general  sensor  motion  as  if  it  were  composed  of  independent  local  environmental 
translations.  The  resulting  representation  can  considerably  simplify  the  processing 
of  less  restricted  and  general  motion.  A  brief  outline  of  the  thesis  follows. 


Thesis  Outline 

Chapters  two  and  three  present  background  information  on  motion  processing. 
In  chapter  two  we  review  the  general  problems  and  previous  work  in  image  motion 
processing.  In  chapter  three  we  review  the  basic  structural  relations  between  image 
displacements  and  sensor  motion. 

In  chapter  four  we  present  a  procedure  for  processing  image  sequences  pro¬ 
duced  by  translational  motion  of  a  sensor  relative  to  a  stationary  environment.  The 
procedure  robustly  combines  the  determination  of  image  displacements  with  the 
extraction  of  the  direction  of  sensor  motion.  Several  experiments  are  performed 
to  show  the  behavior  of  the  procedure  in  different  situations.  As  a  part  of  the 
implementation  we  develop  a  simple  feature  extraction  process. 


In  chapter  five  we  consider  various  extensions  to  the  translational  procedure. 
These  include  developing  the  procedure  as  a  hierarchical  computation  to  increase 
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its  speed;  processing  the  blur  patterns  produced  by  prolonged  exposures  during 
translational  motion;  dealing  with  multiple  independently  moving  objects;  and  using 
the  translational  procedure  in  the  control  of  an  autonomous  vehicle  by  using  devices 
to  stabilize  the  sensor  or  directly  determine  the  other  parameters  of  motion. 

In  chapter  six  we  consider  two  other  restricted  cases  of  motion:  pure  sensor 
rotation  and  motion  constrained  to  a  known  plane.  The  results  are  very  similar  to 
the  translational  case  except  that  certain  simple  cases  of  planar  motion  are  found 
to  be  inherently  ambiguous. 

In  chapter  seven  we  process  less  restricted  and  general  sensor  motion  by  apply¬ 
ing  the  procedure  for  translational  motion  processing  to  local  areas  of  images.  This 
results  in  a  low  level  description  of  motion  called  the  Environmental  Direction  of 
Motion  Field  (or  EDMF)  which  associates  a  direction  of  environmental  motion  with 
extracted  image  features.  This  representation  can  greatly  simplify  the  recovery  of 
sensor  motion  parameters.  We  consider  different  ways  of  computing  the  EDMF  and 
how  sensor  motion  can  be  determined  from  it.  We  present  a  simple  computation 
for  the  case  of  motion  constrained  to  an  unknown  plane.  We  also  develop  the  con¬ 
straints  associated  with  object  rigidity  in  determining  the  inference  of  sensor  motion 
parameters,  and  then  show  how  these  constraints  are  simplified  by  information  in 
the  EDMF. 

In  chapter  eight  we  summarize  the  major  results  of  the  thesis  and  mention 
future  work,  chiefly  in  the  areas  of  architectures  for  real  time  motion  processing, 
and  application  to  more  challenging  and  specific  domains. 


CHAPTER  n 


THE  NATURE  OF  MOTION  PROCESSING 


Introduction 


A  general  outline  of  motion  processing  is  shown  in  figure  1.  This  figure  indicates 
a  basic  control  loop  in  which  the  changes  in  a  sequence  of  images  are  determined  and 
represented,  a  model  is  inferred  from  these  transformations,  and  the  model  is  used  to 
predict  and  constrain  the  processing  of  further  and  ongoing  image  transformations. 


infer 


Figure  1.  The  General  Structure  of  Motion  Processing 


Each  of  these  elements  —  the  image  transformations,  the  inference  of  the  model, 
the  model  itself,  and  the  predictions  —  typically  correspond  to  several  different 
processes  and  representations  which  can  vary  significantly  with  application.  In 
this  representation,  the  beginning  of  the  processing  is  ambiguous  because  of  the 
circuit  nature  of  the  organization.  This  is  an  aspect  of  what  we  will  refer  to  as  the 
start-up  problem,  and  is  concerned  with  whether  it  is  possible  to  determine  image 
transformations  without  an  initial  model.  Generally,  there  is  always  an  initial 
model  which  is  either  based  upon  domain  specific  information  about  the  type  of 
image  transformations  that  can  be  expected  to  occur,  or  implicit  in  the  procedures 
for  determining  image  transformations  by  basing  them  upon  general  environmental 
properties  such  as  continuity  of  motion  and  environmental  surfaces. 

One  implication  of  the  start-up  problem  is  that  motion  processing  always  in¬ 
volves  assumptions  about  the  environment  in  which  it  is  used.  In  many  applications, 
these  assumptions  are  quite  specific  and  task  dependent,  as  in  target  tracking.  In 
others,  the  assumptions  are  more  abstract  and  the  resulting  procedures  have  more 
general  application,  as  in  the  case  of  constrained  types  of  continuous  motion,  con¬ 
strained  types  of  environmental  objects,  or  image  transformations.  A  general  area 
of  research  in  motion  processing  has  been  concerned  with  the  analysis  of  image 
sequences  produced  by  rigid  body  motions  in  the  environment.  This  problem  lends 
itself  to  a  theoretical  development  which  does  not  become  overly  complex,  yet  also 
reflects  a  very  common  occurrence  in  the  real  world.  A  particular  image  transfor¬ 
mation  which  this  analysis  can  utilize  is  also  well  known  —  optic  flow.  This  may 
be  thought  of  as  an  almost  classical  problem  in  image  processing:  the  inference  of 
environmental  information  from  the  optic  flow  field  generated  by  rigid  body  mo¬ 
tions.  In  much  of  what  follows,  the  static  environment  is  viewed  as  a  single  rigid 
body  and  relative  motion  is  induced  by  sensor  motion. 
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Optic  Flow 

Optic  flow  is  the  vector  field  representing  the  changes  in  the  positions  of  the 
images  of  environmental  points  over  time.  It  was  introduced  by  the  psychologist 
J.J.  Gibson  [Gibs50,  Gibs66,  Gibs79j  based,  to  some  extent,  on  his  experiences  as  a 
bomber  pilot  during  the  Second  World  War.  Gibson  was  struck  with  how  different 
patterns  and  extents  of  image  displacements  could  specify  critical  environmental 
information  for  the  control  of  behavior,  such  as  heading,  immediacy  of  collisions, 
and  environmental  layout.  Gibson’s  analysis  has  proven  to  be  extremely  suggestive 
and  stimulating,  but  incomplete,  in  two  critical  aspects.  He  assumed  the  optic  flow 
field  was  a  given  and  did  not  deal  with  the  computational  difficulties  in  determining 
it.  He  also  did  not  explicitly  (at  least  initially  and  never  completely)  analyze  how 
environmental  information  was  extracted  from  the  flow  field.  Both  of  these  problems 
have  come  to  form  the  basis  of  much  research  by  psychologists,  psychophysicists,  and 
researchers  in  computer  vision.  It  is  this  latter  work,  concerning  the  computation 
of  optic  flow  and  the  formation  of  environmental  inferences  from  optic  flow,  upon 
which  we  will  focus. 

There  is  some  ambiguity  in  the  definition  of  optic  flow  in  the  literature  (even 
with  respect  to  the  phrase  itself,  since  optical  flow  or  even  optic  flows  are  used). 
Some  refer  to  the  flow  field  as  being  entirely  independent  of  images,  and  instead 
view  it  as  a  representation  of  the  changes  in  environmental  directions  over  time. 
To  others  it  is  a  basic  description  of  image  motion  determined  from  image  inten¬ 
sity  changes  and  not  necessarily  related  to  environmental  motions.  Both  of  these 
perspectives  have  validity  and  the  sense  to  which  we  are  referring  should  be  clear 
from  the  context  of  whether  we  are  dealing  with  computing  optic  flow  or  forming 
environmental  inferences  from  a  flow  field.  A  further  source  of  ambiguity  is  that 
some  people  refer  to  the  optic  flow  field  as  a  continuous  vector  field  in  which  the 


vectors  are  instantaneous  velocity  vectors,  while  others  refer  to  it  as  a  field  of  dis¬ 
crete  displacement  vectors.  Throughout  this  thesis,  we  refer  to  it  as  a  set  of  discrete 
displacement  vectors. 


Computing  Optic  Flow 

Computing  optic  flow  involves  the  determination  of  the  displacements  of  image 
points  over  a  sequence  of  images.  There  are  several  problems  in  this  computation 
involving  the  effects  of  image  resolution,  the  types  of  dramatic  changes  in  image 
structure  that  can  occur  during  motion  (such  as  occlusion),  and  the  now  well-known 
stimulus  matching  or  correspondence  problem.  To  begin  with,  the  notion  of  an  en¬ 
vironmental  point  corresponding  to  a  distinguishable  image  point  is  an  abstraction 
which  is  difficult  to  realize  computationally.  An  image  point  is  actually  a  small  im¬ 
age  area  which  can  correspond  to  an  appreciable  surface  area  in  the  environment. 
One  aspect  of  this  observation  is  that  actual  flow  fields  do  not  have  an  arbitrarily 
high  level  of  precision.  The  flow  vector  at  a  point  may  actually  summarize  the 
composite  activities  of  an  area  in  the  environment.  Another  implication  is  the 
emergence  or  disappearance  of  detail  as  environmental  surfaces  are  approached  or 
receded  from.  In  such  situations,  features  which  are  meaningful  and  trackable  at 
one  environmental  distance  may  no  longer  be  meaningful  at  another  distance.  This 
provides  motivation  for  the  hierarchical  procedures  for  flow  field  computation  that 
we  discuss  below.  It  also  reflects  an  important  assumption  applied  throughout  mo¬ 
tion  processing:  during  motion  the  image  structures  will  change  sufficiently  slowly 
to  allow  the  changes  to  be  determined,  but  not  so  dramatically  that  correspondence 
becomes  unrecognizable  at  successive  instants.  Often  this  is  not  a  valid  assumption 
and  reflects  another  basic  problem  with  computing  optic  flow.  Highly  significant 
information  can  be  obtained  from  particular  situations  at  which  the  optic  flow  field 
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becomes  non-existent  or  singular,  and  thus  difficult  to  compute.  These  situations 
are  related  to  image  events  such  things  as  occlusion,  the  motion  of  specularities,  and 
the  presense  of  smooth  extremal  boundaries.  Another  source  of  confusing  changes 
are  the  wide  range  of  general  noise  effects  in  image  formation. 


Figure  2.  The  Stimulus  Matching  Problem 


The  stimulus  matching  or  correspondence  (Burt76,  Huan81,  Thom81,  Ullm81] 
problem  refers  to  the  ambiguity  in  determining  image  displacements,  and  is  partic¬ 
ularly  problematic  with  nondistinctive  portions  of  image  structures  or  homogeneous 
image  areas.  The  difficulties  are  simply  exemplified  by  the  situation  illustrated  in 
figure  2  which  shows  a  square  undergoing  a  diagonal  displacement.  The  informa¬ 
tion  obtainable  at  a  portion  of  one  of  the  edges  only  constrains  the  locally  observed 
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edge  motion  to  a  wide  range  of  potential  displacements.  The  general  form  of  the 
stimulus  matching  problem  involves  the  manner  in  which  local  determination  of 
displacements  can  result  in  a  globally  coherent  interpretation  of  the  changes  in  an 
image  sequence. 

Techniques  developed  to  date  for  computing  optic  flow  can  be  grouped  into 
matching  techniques  and  differential  techniques.  Both  of  these  techniques  have  to 
deal  with  the  problems  just  described  and  are  distinguished  by  the  different  assump¬ 
tions  under  which  they  operate.  Both  can  be  expressed  hierarchically  (though  it  is 
more  typical  for  matching  procedures).  This  allows  the  procedures  to  be  expressed 
uniformly  across  different  image  resolutions,  and  a  flow  field  to  be  determined  by 
utilizing  required  consistencies  between  image  displacements  in  images  at  different 
resolutions. 


Matching  Techniques 

Matching  techniques  can  be  roughly  distinguished  by  the  types  of  image  struc¬ 
tures  upon  which  they  operate  and  the  criteria  by  which  matches  of  image  structures 
in  successive  images  are  determined.  Image  structures  can  be  ordered  by  the  ex¬ 
tent  and  the  locality  of  processing  required  in  their  extraction  and  the  complexity 
of  the  structural  relations  in  their  description.  In  general,  the  more  abstract  the 
image  structure,  the  more  stable  it  becomes  over  a  sequence  of  images  because  the 
ambiguity  in  determining  matches  is  reduced.  For  example,  if  a  complete  seman¬ 
tic  analysis  of  each  image  has  been  performed  in  a  sequence  taken  from  a  sensor 
moving  relative  to  a  house,  it  is  easier  to  match  at  the  level  of  extracted  houses  in 
the  successive  images  than  a  less  abstract  and  more  local  feature  level,  such  as  a 
vertical  edge.  There  are  fewer  things  to  match  and  they  cover  an  area  of  the  image 
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significantly  larger  than  their  potential  displacements. 

Examples  of  image  structures  that  have  been  (or  could  be)  used  in  motion 
analysis,  organized  in  terms  of  increasing  abstraction  are  distinctive  raw  image  sub- 
areas  [Agga81b,  Barn80,  Dres81,  Hann74,  Levi73,  Mora81,  Quam71],  parameter¬ 
ized  tokens  describing  local  image  subareas  [Hara82,  Hara83,  Lee82,  Prag79],  edges 
[Agga81a,  Burr77,  Mart79j,  regions  [Medi83,  Nage77,  Nage78,  Radi81,  Roac79], 
structural  descriptions  of  edges  and  regions  [Brad83,  Jaco80],  instantiated  environ¬ 
mental  surfaces  [Will80j,  and  various  high  level  semantic  interpretations  [Badl75, 
Tsot80]. 

Procedures  for  determining  optic  flow  have  generally  been  restricted  to  match¬ 
ing  features  whose  extraction  involves  very  little  processing  and  are  based  on  local 
image  structures  and  computations.  This  is  a  consequence  of  optic  flow  being  viewed 
as  a  very  primitive  description  of  image  motion  from  which  much  information  that  is 
useful  for  higher  level  processes  will  be  derived.  From  this  perspective,  flow  process¬ 
ing  should  not  be  dependent  on  the  processes  to  which  its  results  will  contribute. 
Also,  when  more  abstract  descriptions  are  used,  although  the  determinations  of 
matches  becomes  more  viable,  the  determination  of  specific  image  displacement  be¬ 
comes  less  exact.  This  reflects  a  general  problem  that  has  been  largely  ignored  by 
researchers  in  motion  (with  some  important  exceptions,  notably  Tsotos  [Tsot80]): 
the  mechanisms  by  which  matches  at  different  semantic  levels  of  image  descriptors 
can  be  combined  into  a  coherent  interpretation  of  an  image  sequence.  Here,  the 
matches  between  lower  level  image  structures  could  be  constrained  by  the  matches 
determined  at  higher  levels  of  surface  or  semantic  description.  The  same  question 
is  involved  in  prediction  of  feature  displacements  from  a  model  in  which  the  model 
may  consist  of  relatively  distinct,  multilevel  information,  and  is  used  to  constrain 
the  interpretation  and  displacements  of  low  level,  local  processes  and  features. 


In  general,  most  matching  procedures  that  have  been  developed  do  not  explicitly 
deal  with  the  dramatic-change  and  resolution  problems.  Due  to  the  assumption 
that  most  image  structures  will  change  slowly  over  time,  if  dramatic  changes  do 
occur,  they  will  be  reflected  by  a  break-down  in  the  matching  processes.  The 
basic  approach  to  the  stimulus  matching  problem  has  been  to  characterize  global 
properties  of  the  displacement  field  in  a  manner  which  directs  the  evaluation  of 
image  displacements.  This  is  done  in  different  ways.  Matching  structures  at  a 
more  abstract  or  symbolic  level  typically  involves  matching  strings  or  graph-like 
structures.  There  are  solutions  to  this  type  of  problem  using  dynamic  programming 
or  heuristic  search  techniques  to  minimize  some  global  distortion  measure  reflecting 
the  extent  of  graph  similarity  [Barr72,  Chen82,  Hara78,  Shap82].  In  another  form 
of  match  processing  typically  applied  to  less  abstract  features,  a  global  property 
such  as  smoothness  or  continuity  of  the  displacement  field  is  used  to  form  a  local 
constraint  on  the  flow  field  computation.  This  constraint  leads  to  a  local,  iterative, 
relaxation  type  procedure  in  which  a  given  feature  displacement  must  be  consistent, 
under  the  criteria  of  smoothness,  with  the  displacements  of  its  spatially  neighboiing 
features  [Barn80,  Prag79],  Updating  rules  take  the  form  of  setting  a  feature’s 
estimate  of  its  correct  displacement  to  the  average  of  its  neighbors. 

Generalized  Hough  transform  approaches  to  matching  [Agga81b,  Ball81, 
0’Rou81,  Davi83]  somewhat  reverse  the  relation  between  local  computations  and 
global  field  properties  when  compared  to  the  relaxation-based  matching  approaches 
just  described.  In  the  generalized  Hough  approaches,  the  properties  of  a  displace¬ 
ment  field  are  parameterized  and  represented  in  an  n-dimensional  histogram  to 
which  the  local  image  measurements  contribute.  For  example,  the  global  structure 
of  the  flow  field  can  be  restricted  to  being  a  particular  type  of  transformation,  such 
as  an  affine  transformation  in  the  plane.  Each  local  process  for  determining  an 
image  displacement  evaluates  the  consistency  of  its  potential  displacements  with 
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the  values  of  the  parameters  describing  each  affine  transformation  (up  to  some  level 
of  parametric  resolution).  Globally,  the  parameter  value  most  consistent  with  all 
of  the  potential  image  displacements  will  have  the  most  favorable  evaluation  (or  re¬ 
sponse  in  the  histogram).  Once  a  global  interpretation  has  been  determined,  it  can 
then  be  refined  with  increased  resolution  in  the  parameter  space  about  the  coarse 
solution. 


Differential  Techniques 

Differential  techniques  are  based  on  direct  measurements  of  intensity  changes 
perpendicular  to  an  image  gradient  in  order  to  determine  one  component  of  the  op¬ 
tic  flow  at  a  point.  These  measurements  are  expressed  as  a  function  of  the  temporal 
changes  in  image  intensity  and  the  image  gradient  at  a  point.  The  other  component 
is  then  determined  by  using  an  additional  constraint  derived  from  assumptions  con¬ 
cerning  the  global  structure  of  the  flow  field.  These  generally  involve  smoothness 
of  the  flow  field  or  the  type  of  transformations  that  can  describe  the  displacement 
field.  In  a  manner  similar  to  the  matching  techniques,  these  constraints  can  be  de¬ 
veloped  computationally  as  local,  iterative  processes  in  which  global  consistency  is 
achieved  via  propagation  similar  to  solutions  of  diffusion  equations  [Horn80,  Glaz81, 
Glaz83a,  Terz83j.  In  a  few  applications  [Fenn79,  Thom81j,  the  local  measurements 
can  also  be  integrated  by  their  independent  contributions  to  a  global  histogram 
which  expresses  the  parameter  values  of  particular  types  of  image  transformations. 
Differential  techniques  can  also  be  used  to  roughly  constrain  the  motion  of  bound¬ 
aries  [Marr79]  without  trying  to  derive  the  optic  flow.  These  constraints  can  be 
used  to  get  rough  qualitative  motion  information  along  closed  contours,  such  as 
expansion,  image  motion  in  a  rough  direction,  or  the  occurrence  of  rotation. 


The  key  attributes  of  differential  techniques  is  that  they  are  based  on  very 
local,  simple  computations  that  may  be  performed  at  a  low  level  of  processing. 
They  are  also  based  on  some  unrealistic  assumptions  that  show  up  when  these 
techniques  are  uniformly  applied  to  actual  image  sequences.  These  assumptions 
concern  smoothness  and  often  linearity  in  the  image  intensity  gradients,  limited 
extents  of  motion,  and  the  constancy  of  image  brightness  overtime.  The  smoothness 
assumption  breaks  down  at  surface  occlusion  boundaries,  or  wherever  dramatic 
image  changes  occur  such  as  at  reflectance  boundaries.  Differential  techniques 
also  tend  to  produce  dense  fields,  whose  value  is  not  clear,  especially  since  the 
interpolation  is  performed  in  a  manner  that  may  adversely  affect  the  inference  of 
motion  parameters.  Researchers  are  focusing  on  some  of  these  problems:  Schunk 
[Schu83]  has  tried  to  characterize  the  effects  of  occlusion  so  that  the  computation  of 
image  displacements  are  selectively  shut  off  in  such  areas.  Nagel  [Nage83],  Hildreth 
[Hild82],  and  Kearney  [Kear82]  are  working  with  more  complex  image  gradients  and 
integrating  the  components  of  information  to  the  degree  they  provide  unambiguous 
displacement  information  at  boundaries. 

Hierarchical  Processing 

A  basic  paradigm  in  computer  vision  is  the  use  of  hierarchical  representations 
and  processes  [Burt82,  Hans80,  Rose83,  Tani80,  Uhr78].  This  allows  different 
resolutions  and  scales  of  image  events  to  be  handled  uniformly.  Additionally,  the 
consistent  agreement  among  hierarchically  organized  processes  is  a  basic  control 


strategy  for  a  wide  range  of  high  and  low  level  interpretation  tasks.  Hierarchical 
processing  can  produce  significant  computational  reductions,  wherein  results  from 
processing  performed  rapidly  at  lower  resolutions  of  image  information  are  used  to 
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direct  and  constrain  more  detailed  and  extensive  processing  of  higher  resolution 
image  information.  Given  the  increase  in  computational  requirements  over  static 
image  processing,  hierarchical  mechanisms  are  extremely  important  in  real-time 
motion  processing. 

The  use  of  hierarchical  processing  in  motion  typically  involves  representing  an 
image  at  different  filtered  spatial  frequencies  and  using  the  processing  at  lower  spa¬ 
tial  frequencies  to  constrain  the  processing  at  higher  spatial  frequencies  [Burt82, 
Glaz83b,  Grim8i,  Luca81,  Wong78].  The  matches  determined  for  the  larger  spatial 
structures  in  an  image  are  used  to  initialize  the  computation  for  the  displacements 
of  the  smaller  structures.  In  hierarchically  organized  processing,  the  resolution 
problem  is  handled  implicitly  by  representing  an  image  sequence  at  multiple  res¬ 
olutions  simultaneously.  The  stimulus  matching  problem  is  dealt  with  by  taking 
advantage  of  the  fact  that  matches  have  a  tendency  to  be  less  ambiguous  at  lower 
spatial  frequencies  because  there  are  fewer  gross  image  structures  and  they  are  large 
relative  to  their  potential  displacements.  However,  the  problems  of  dramatic  change 
associated  with  flow  field  computation  affects  hierarchical  processing  because  image 
structures  may  appear  and  disappear  at  different  levels  of  resolution  and  errors  pro¬ 
duced  at  a  lower  image  resolutions  can  propagate  to  the  higher  resolution  images. 
Some  filtering  schemes  (Burt83,  Glaz83b]  have  been  proposed  to  deal  with  this  in¬ 
herent  problem  by  detecting  the  occurrence  of  a  failure  in  the  matching  procedure 
and  shutting  off  the  initialization  of  image  displacements  in  the  higher  resolution 
images. 


Inference  of  Environmental  Information 


Work  in  the  inference  of  environmental  information  from  flow  fields  has  gen 


erally  been  restricted  to  the  case  of  rigid  body  motion  or  linked  systems  of  rigid 
bodies  [Webb81|.  There  is  very  little  general  understanding  in  the  interpretation 
of  non-rigid  environmental  motions.  Often,  such  work  is  task  dependent  as  in  the 
interpretation  of  image  sequences  of  moving  cloud  formations  and  beating  hearts 
[Tsot80|. 

The  problem  of  inferring  environmental  information  from  a  flow  field  produced 
by  rigid  body  motion  is  often  termed  the  shape-from- motion  problem  (i.e.,  how 
to  determine  the  shape  of  objects  or  environmental  depth  from  a  flow  field  or  a 
sequence  of  flow  fields);  or,  somewhat  confusingly,  the  motion-from-motion  problem 
(i.e.,  how  to  determine  the  parameters  of  object  or  sensor  motion  from  a  flow  field 
or  sequence  of  flow  fields).  Theoretically,  these  problems  are  equivalent,  though 
there  are  practical  difficulties  in  inferring  one  from  the  other. 

There  have  been  significant  milestones  in  formulating  solutions  to  these  prob¬ 
lems  in  motion  processing  research.  One  set  of  results  has  dealt  with  the  minimal 
conditions  that  are  necessary  for  determining  object  shape  and  sensor  motion  in 
terms  of  the  number  of  flow  vectors  across  an  image  sequence  [Fang83b,  Lawt80, 
Meir80,  Roac80,  Ullm79,  Webb81,  Yen83j.  In  this  work,  researchers  derive  vari¬ 
ous  sets  of  simultaneous  nonlinear  equations  whose  solution  would  constitute  the 
appropriate  inference.  Since  these  equations  cannot  be  solved  directly,  various 
optimization  procedures  are  required.  In  another  set  of  formulations  developed 
primarily  by  Nagel  [Nage81]  and  Prazdny  [Praz81j,  the  inference  of  sensor  motion 
parameters  is  expressed  as  a  search  through  the  rotational  subspace  of  the  total  set 
of  rigid  body  motion  parameters.  Prazdny ’s  development  is  rather  geometrical  and 
Nagel’s  is  more  algebraic,  but  they  are  basically  similar.  In  1981,  Tsai  and  Huang 
[Tsai82],  simultaneously  with  Longuet-Higgins  [Long81],  developed  a  closed  form 
solution  which  could  be  solved  by  direct  means. 
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Given  these  developments  over  the  past  several  years,  it  is  somewhat  alarming 
that  none  of  the  techniques  have  been  successfully  applied  to  flow  fields  computed 
from  anything  like  realistic  image  sequences.  In  fact,  only  in  the  recent  work  of 
Huang  and  Fang  [Fang83a,  Fang83b]  and  Jerian  and  Jain  [Jeri83]  has  there  even 
been  an  explicit  evaluation  of  a  procedure  on  such  images.  This  work  has  shown  the 
particular  difficulties  familiar  to  motion  researchers:  extreme  sensitivity  to  noise  and 
resolution,  dependence  upon  the  type  and  extent  of  motion,  and  general  instability. 

A  possible  exception  to  these  difficulties  may  be  a  procedure  recently  developed 
by  Rieger  and  Lawton  [Rieg83,  Lawt83].  The  technique  is  restricted  to  recover¬ 
ing  the  parameters  of  sensor  motion  relative  to  a  stationary  environment  and  is 
based  upon  the  fact  that  the  decomposition  of  a  flow  field  into  its  rotational  and 
translational  components  can  be  directly  obtained  at  image  positions  where  a  signif¬ 
icant  depth  variation  occurs  in  the  environment  [Long80],  such  as  at  some  occlusion 
boundaries.  This  results  in  a  very  simple  analysis  which  does  not  involve  solving 
unstable  equations.  The  basic  practical  difficulty  associated  with  this  technique  is 
that  it  is  dependent  on  the  analysis  of  a  flow  field  at  occlusion  boundaries  where  the 
flow  field  tends  to  be  most  errorful.  Dealing  with  this  effect  requires  a  computation 
which  may  reduce  the  precision  of  the  inference  of  the  sensor  motion  parameters. 

There  are  many  reasons,  not  all  of  which  are  fully  understood,  why  the  infer¬ 
ence  of  motion  parameters  and  environmental  depth  has  been  difficult.  Some  of  the 
formulations  involve  image  measurements,  such  as  higher  order  derivatives  of  an 
instantaneous  vector  velocity  field  which  are  difficult  to  obtain  and  are  also  quite 
noise  sensitive  when  applied  to  discrete  image  sequences  [Praz80,  Long80].  There 
are  also  many  cases  of  motion  which  are  inherently  ambiguous.  One  of  these  is  dis¬ 
cussed  in  chapter  VI  of  this  thesis  and  concerns  a  rather  typical  case  of  terrestrial 
motion  in  which  the  rotational  and  translational  field  components  are  nearly  impos- 


aible  to  separate.  In  recent  work  concerning  the  interpretation  of  images  containing 
multiple  independently  moving  objects,  Adiv  [Adiv84]  appears  to  have  found  cases 
in  which  independently  moving  objects  with  different  parameters  of  motion,  can, 
when  considered  together,  result  in  a  globally  consistent,  but  incorrect,  interpre¬ 
tation.  Another  problem  affecting  shape  from  motion  formulations  is  the  baseline 
effect  which  is  common  to  stereo.  The  baseline  effect  expresses  that  the  resolution 
and  accuracy  of  depth  inferences  are  a  decreasing  function  of  the  distance  between 
the  sensor  locations  at  which  images  are  formed.  For  motion,  wL~  i  the  sensor 
displacements  are  generally  small  between  successive  instants,  the  environmental 
inference  would  tend  to  be  poor,  but  could  be  compensated  by  the  availability  of 
more  and  more  images  over  time. 

There  has  been  almost  no  stability  analysis  of  the  systems  of  equations  for  in¬ 
ference  from  optic  flow.  Along  these  lines,  recent  work  by  colleagues  and  myself 
[Stee83j  has  given  empirical  indications  of  the  instabilities  in  the  inference  proce¬ 
dures  under  certain  conditions.  We  have  been  exploring  the  use  of  a  highly  parallel 
array  architecture  for  inferring  motion  parameters  from  flow  fields.  This  processing 
amounts  to  sampling  and  evaluating  200,000  points  in  the  five  dimensional  space 
of  determinable  rigid  body  motion  parameters  at  near  video  rates.  This  roughly 
shows  the  appearance  of  the  error  surface  these  system  of  equations  may  describe. 
What  this  work  indicates  is  that  the  space  is  very  bumpy  and  jagged,  full  of  local 
optima,  that  would  make  solutions  difficult,  especially  in  the  presence  of  noise. 

There  have  been  several  responses  to  these  difficulties.  One  approach  has  been 
to  utilize  optimization  procedures  which  are  based  on  global  evaluation  of  the  ex¬ 
pressions  for  the  inference  of  motion  parameters  from  flow  fields  instead  of  local, 
iterative  optimization  procedures.  Examples  of  these  approaches  are  the  work  with 
generalized  Hough  transforms  (Adiv84,  Ball81]  and  the  procedure  involving  highly 


parallel  architectures  mentioned  above  [Stee83].  Some  researchers  are  beginning  to 
perform  an  explicit  analysis  of  the  stability  of  the  different  solutions  [Shaw83),  while 
others  are  trying  to  develop  qualitative  inference  techniques  which  are  hoped  to  be 
more  robust  [Thom83j,  and  still  others  are  beginning  to  investigate  the  inference 
of  motion  and  shape  from  image  transformations  other  than  optic  flow,  such  as 
the  analysis  of  contour  shape  changes  [Davi82].  Currently,  much  of  this  work  is 
preliminary. 

Another  response  to  these  inadequacies  has  been  to  deal  with  restricted  cases 
of  motion.  Here  too,  the  work  has  been  limited  in  application  to  realistic  image 
sequences  with  principle  results  having  been  achieved  by  Williams  [Will80]  and 
Dreschler  and  Nagel  [Dres81|.  These  restricted  cases  of  motion  can  be  of  signifi¬ 
cant  practical  use,  since  in  many  cases  some  of  the  parameters  of  motion  can  be 
determined  by  other  sensing  devices.  Additionally,  general  motion  can  be  locally 
interpreted,  temporally  and  spatially,  as  consisting  of  certain  restricted  types  of 
motion. 

In  the  research  presented  in  this  thesis,  we  will  develop  procedures  for  various 
cases  of  restricted  motion,  and  show  how  to  use  the  procedures  for  translational 
motion  to  locally  interpret  more  general  motion.  In  this  regard,  it  is  useful  to  sum¬ 
marize  related  work  in  vanishing  point  extraction  and  translational  motion  process¬ 
ing.  The  determination  of  the  vanishing  point  in  a  static  image  is  closely  related 
to  determining  the  direction  of  translation.  In  perspective  projection,  parallel  lines 
in  the  environment  map  onto  fines  radiating  from  the  vanishing  point  in  the  image. 
For  translational  motion,  the  environmental  motion  paths  correspond  to  the  par¬ 
allel  lines  in  the  perspective  case.  Techniques  for  extraction  of  a  vanishing  point 
have  been  explored  by  Render  (Kend79],  Nakatani  [NakaSO],  and  in  a  more  general 
framework  by  Ballard  [BallSl].  The  use  of  the  Hough  transform  in  this  work  is  sim- 
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liar  to  the  global  sampling  of  the  error  measure  developed  in  chapter  IV.  It  would 
be  interesting  if  the  determination  of  edges  could  be  combined  with  the  determi¬ 
nation  of  the  vanishing  point,  in  a  manner  similar  to  the  concurrent  determination 
of  image  displacements  and  the  translational  axis  in  the  work  presented  in  chapter 
IV. 


Williams  [Will80]  was  the  first  to  develop  algorithms  for  interpreting  natural 
complex  images  produced  by  an  optic  sensor  translating  relative  to  environmental 
objects.  This  work  consisted  of  two  processes:  one  for  inferring  the  direction  of 
translation  given  environmental  depth  information  and  the  other  for  inferring  depth 
given  the  direction  of  motion.  These  processes  used  an  error  measure  describing  the 
consistency  of  depth  information  and  the  inferences  of  feature  motion  along  image 
displacement  paths.  His  work  indicated  that  these  two  processes,  for  inferring  depth 
and  the  direction  of  motion,  could  be  combined. 

The  primary  weakness  of  Williams’  work  was  the  necessary  restriction  to  planar 
surfaces  at  one  demonstrated  orientation.  Additionally,  in  the  case  of  unknown 
environmental  depth  and  translation,  the  processing  is  quite  complex  —  involving 
segmentation,  resegmentation,  and  coordinating  the  processes  for  inferring  depth 
and  for  inferring  the  direction  of  translation.  The  method  we  develop  in  chapter 
IV  requires  no  restrictions  on  the  orientation  of  surfaces  or  shape  of  environmental 
objects,  and  involves  only  a  simple  procedure  for  evaluating  an  error  measure.  It 
also  indicates  that  the  direction  of  sensor  motion  should  be  determined  prior  to,  or 
concurrently  with,  environmental  depth. 


CHAPTER  m 


DISPLACEMENT  FIELD  STRUCTURE 

Introduction 

In  this  chapter  we  review  the  relations  between  sensor  motion  relative  to  rigid 
body  objects  and  the  structure  of  the  corresponding  field  of  image  displacements. 
Basic  results  from  kinematics  [Whit44j  and  geometry  [Coxe61]  allow  arbitrary  rigid 
body  motions  of  the  camera  to  be  decomposed  into  a  rotation  about  its  focal  point 
followed  by  a  translation.  This  permits  image  motions  to  be  described  as  consisting 
of  two  components:  a  rotational  and  a  translational  field.  The  rotational  field  con¬ 
tains  information  concerning  sensor  orientation  relative  to  the  environment,  while 
the  translational  component  contains  information  concerning  environmental  depth 
and  the  relative  displacements  of  the  sensor  and  environmental  objects.  This  de¬ 
composition  forms  the  basis  of  procedures  for  recovering  camera  motion  parameters 
from  displacement  fields  [Nage81,  Praz81j. 

Describing  Rigid  Body  Motion 


In  this  section  we  review  some  basic  terminology  for  describing  image  and  envi¬ 
ronmental  motion,  the  particular  coordinate  systems  employed,  and  how  rigid  body 
motions  are  described  in  terms  of  sensor  motion. 


20 


21 


Terminology 

It  is  necessary  to  have  terms  for  describing  the  motion  of  features  in  an  im¬ 
age  sequence  and  the  corresponding  motion  of  environmental  points.  We  define  an 
Image  Displacement  Vector  to  be  a  two-dimensional  vector  describing  the  displace¬ 
ment  of  an  image  feature  from  one  image  to  the  next.  An  Image  Displacement 
Field  is  the  set  of  image  feature  displacement  vectors  for  successive  images.  An 
Image  Displacement  Sequence  indicates  the  positions  of  an  image  feature  over  sev¬ 
eral  successive  images.  Though  we  are  dealing  with  discrete  image  sequences,  it  is 
often  possible  to  describe  the  continuous  curve  along  which  an  image  feature  point 
is  moving.  This  curve  is  called  the  Image  Displacement  Path. 

Corresponding  to  image  motions  we  use  a  set  of  terms  for  describing  environmen¬ 
tal  motions.  An  Environmental  Displacement  Field  is  the  set  of  three-dimensional 
vectors  indicating  the  positions  of  environmental  points  at  successive  instants.  An 
Environmental  Displacement  Sequence  indicates  the  position  of  an  environmental 
point  over  several  successive  instants.  An  Environmental  Displacement  Path  de¬ 
scribes  the  three-dimensional  curve  that  an  environmental  point  is  moving  along 
for  a  particular  motion. 


Coordinate  Systems 

We  utilize  two  coordinate  systems  in  this  exposition:  a  fixed  system  based  on  the 
environment  and  another  based  on  the  sensor.  The  fixed  environmental  coordinate 
system  is  a  Cartesian  coordinate  system.  The  sensor  coordinate  system  (or  camera 
model)  is  referred  to  throughout  this  thesis  and  consists  of  a  planar  retina  embedded 
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in  a  three-dimensional  Cartesian  coordinate  system  (X,  Y,  Z) ,  with  the  origin  at 
the  focal  point  and  the  optical  axis  aligned  with  the  positive  Z—  axis  (figure  3).  The 
X  and  Y  axes  correspond  to  the  gravitationally  intuitive  horizontal  and  vertical 
directions,  respectively.  The  image  plane  is  parallel  to  the  XY  plane  and  located 
at  a  distance  of  one  focal  length  along  the  Z  axis. 


Figure  3.  Camera  Model. 


Positions  in  the  image  plane  are  described  using  a  2  -D  coordinate  system  with 
the  axes  A  and  B  aligned  with  the  X  and  Y  axes  of  the  camera  coordinate 
system,  respectively.  The  origin  of  the  image  plane  coordinate  system  is  determined 
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by  the  intersection  of  the  image  plane  and  the  Z  -  axis.  The  vector  Pm ,•  refers  to 
the  position  of  an  environmental  point  in  the  sensor  coordinate  system  and  the 
vector  /m,  refers  to  the  position  of  the  intersection  of  the  ray  of  projection  for 
Pmi  with  the  image  plane.  The  first  index  of  these  vectors  is  used  to  specify  a 
particular  image  from  a  sequence  of  images.  The  second  index  specifies  a  particular 
environmental  point.  Setting  the  focal  length  to  one,  the  relations  between  Pmi- , 
zmi ,  and  positions  in  the  image  plane  determined  by  perspective  projection  are: 


The  position  and  the  orientation  of  the  sensor  relative  to  the  environmental 
coordinate  system  at  time  t  is  described  by  the  vector  P(t)  and  the  matrix  0(t) , 
where  P(t)  is  the  position  of  the  origin  of  the  sensor  coordinate  system  at  time  t , 
and  0(t )  describes  the  orientation  of  the  sensor  coordinate  system  by  its  direction 
cosines.  The  matrix  0(t)  is  obtained  by  translating  the  sensor  coordinate  system 
to  the  origin  of  the  environmental  coordinate  system  and  determining  the  angles 
between  the  axes  of  the  two  coordinate  systems.  Denoting  the  coordinate  axes 
of  the  camera  coordinate  system  as  (Xc,  Ye,  Ze)  and  those  of  the  environmental 
coordinate  system  as  (X,  Y,  Z)  yields: 
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0{t)  = 


f  cos(Jf,  X() 
cos (Y,  Xc) 
^  cos(£,  Jrc) 


cos(X,  Yc)  cos  {X,ZC)\ 
cos  (Y,YC)  cos  (Y,ZC) 
C08(Z,  Yc )  C08 (Z,  Zc)  ) 


(2) 


Decomposing  Rigid  Body  Motion 


There  are  some  basic  results  in  kinematics  which  allow  arbitrary  rigid  body 
motions  to  be  expressed  as  consisting  of  a  rotation  about  an  axis  positioned  at  an 
arbitrary  point  followed  by  a  translation.  These  are  stated  as 


A  rotation  about  any  axis  is  equivalent  to  a  rotation  through  the 
same  angle  about  any  axis  parallel  to  it,  together  with  a  simple 
translation  in  a  direction  perpendicular  to  the  axis.  The  converse 
is  also  true,  the  rotation  of  a  rigid  body  about  any  axis,  preceded 
or  followed  by  a  translation  in  a  direction  perpendicular  to  the  axis, 
are  together  equivalent  to  a  rotation  of  the  body  about  a  parallel 
axis  [Whit44]. 


Thus,  the  orientation  of  a  body  will  change  the  same  for  parallel  axes  of  rotation 
with  the  same  extent  of  rotation,  regardless  of  where  they  are  positioned.  This 
implies  that  the  axis  of  rotation  can  be  positioned  anywhere  so  long  as  it  is  followed 
by  the  appropriate  translation.  Thus,  we  can  canonically  describe  sensor  motion  as 
an  initial  rotation  about  an  axis  positioned  at  .he  origin  of  the  sensor  coordinate 
system  (bringing  the  sensor  into  the  same  orientation  at  successive  instants)  followed 
by  a  translation  (bringing  the  sensor  in  coincidence  at  the  successive  instants).  This 
will  also  decompose  an  image  displacement  field  into  a  field  produced  solely  by  the 
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rotation  of  the  sensor  and  a  field  produced  solely  by  the  translation  of  the  sensor. 
Each  of  these  fields  contains  different  information. 

More  specifically,  given  the  sensor  at  successive  positions  and  orientations  (P(t), 
0(t))  and  ( P(t  4-  1),  0(f  +  1)) ,  its  motion  is  described  as  an  initial  rotation  about 
the  origin  of  the  sensor  coordinate  system  described  by  the  matrix  R  such  that 
0(t  +  1)  =  0(f)  *  R  ,  followed  by  a  translation  T  with  respect  to  the  environmental 
coordinate  system  such  that  P(f  +  1)  =  P{t)  x  T .  Thus, 


O(t)"1  x  0(f  +  1)  =  R 


(3) 


/  1  0  0  0\ 
0  10  0 

0  0  10 

\PX(t)  Py(t)  Pz{t)  \) 


=  T 


Properties  of  Pure  Rotational  Displacement  Fields 

Let  us  consider  rotational  fields  that  are  produced  by  rotation  about  an  axis 
containing  the  origin  of  the  sensor  coordinate  system.  The  basic  property  of  such 
fields  is  that  the  image  displacements  are  totally  a  function  of  image  position  and  can 
yield  no  information  concerning  environmental  depth.  That  is,  given  the  position 
of  an  image  point  at  time  f  and  the  sensor  rotation  R ,  its  position  at  time  t  +  1 


To  describe  the  general  structure  of  rotational  flow  fields,  consider  the  image 
displacement  path  generated  by  a  particular  image  point  under  sensor  rotation.  In 
figure  4a  we  see  an  axis  of  rotation  positioned  at  the  origin  of  the  coordinate  system 
and  a  ray  of  projection  determined  by  some  image  point  Jm,- .  The  effect  of  the 
rotation  will  be  that  the  ray  of  projection  will  generate  the  surface  of  a  cone.  The 
image  displacement  path  for  the  rotation  of  this  image  point  will  then  be  determined 
by  the  intersection  of  this  cone  with  the  image  surface,  i.e.  a  conic  section. 


Figure  4a.  Rotational  Displacement  Paths.  The  figure  on  the  left  shows 
the  intersection  of  an  image  plane  with  the  cone  determined  by  the  axis  of 
rotation  positioned  at  the  focal  point  and  a  given  image  position  vector. 
The  figure  on  the  right  shows  the  resulting  conic  image  displacement  path. 
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One  should  note  that  for  points  along  the  same  ray  of  projection,  the  image  dis¬ 
placements  under  a  given  rotation  will  all  be  the  same.  Thus,  there  is  no  basis  upon 
which  to  infer  environmental  depth  under  rotational  motion  because  the  angles  be¬ 
tween  rays  of  projection  remains  fixed. 

Now  let  us  consider  sensor  rotation  analytically  with  the  axis  of  rotation  rep¬ 
resented  as  a  unit  vector  R  =  (Rx,  Ry,  Rz) .  For  any  environmental  point  P  = 
(x,  y,  z) ,  we  can  describe  the  cone  generated  by  the  rotation  to  be: 


,/*  P-R 

C  =  CO8(0)  =  (4) 

where  9  is  the  angle  between  R  and  P .  To  determine  the  image  displacement 
paths,  we  expand  this  equation  with  z  set  to  1  (corresponding  to  the  location  of 
the  image  plane): 


r  _  xRz  4-  yR9  +  Rz 
\Jx2  +  y2  +  1 


By  squaring  both  sides  and  reorganizing  terms,  this  equation  may  be  expressed  as 
an  implicit  function  in  the  general  form  of  a  conic: 


F(z,y)  =  x\R\  -  c2)  +  y2(*J  -  c2)  +  2 x(RtRt) 
+2y(RyRz)  +  2xy{RxRy)  +  (fl2  -  c2)  =  0 


(6) 


The  partial  derivatives  of  this  equation  yield  the  tangents  to  the  image  displacement 
path: 


i 


^^l  =  2z(Rl-c*)  +  2(R,R,)  +  2y{R,R,)  (7) 


i,1’ v)  =  ~  c2)  +  MW.)  + 


Note  that  for  the  rotational  axis  aligned  with  the  Z  axis,  R  =  (0,0, 1)  substitution 
into  equation  6  yields 


z2  +  y2  =  ?~1  (8) 

This  describes  a  family  of  circles  in  the  image  plane  centered  at  (0, 0, 1)  and  indexed 
by  the  particular  values  of  c  in  the  range  0  to  1  (figure  4b).  For  the  rotational  axis 
R  =  (0, 1,0)  substitution  into  equation  6  yields 


y 


2 


1-C2 

c2 


-x2  =  l 


(9) 


This  describes  a  family  of  hyperbolas  indexed  by  values  of  c  in  the  range  0  to  1 
(figure  4c). 


For  purely  translational  motion  the  sensor  orientation  is  fixed  relative  to  the 
environmental  coordinate  system  and  the  motion  is  described  by  an  axis  of  trans¬ 
lation.  The  image  displacement  paths  are  determined  by  the  intersection  of  the 
translational  axis  with  the  image  plane.  If  the  translational  axis  intersects  the 
image  plane  on  the  positive  half  of  the  axis,  the  point  of  intersection  is  called  a 
Focus  of  Expansion  (FOE)  and  the  image  motion  is  along  straight  lines  radiating 
from  it.  This  corresponds  to  sensor  motion  towards  visible  environmental  points. 
If  the  translational  axis  intersects  the  image  plane  on  the  negative  half  of  the  axis, 
the  point  is  called  a  Focus  of  Contraction  (FOC)  and  the  image  displacement  paths 
are  along  straight  lines  converging  towards  the  FOC.  This  corresponds  to  camera 
motion  away  from  visible  environmental  points.  The  intersections  of  axes  parallel 
to  the  image  plane  are  points  at  infinity  and  thus  may  be  considered  to  be  either 
an  FOE  or  FOC  in  opposite  directions.  This  ambiguity  is  one  reason  we  refer  to 
the  directions  of  motion  determined  by  the  translational  axes  themselves  instead  of 
the  intersections  with  the  image  plane. 

Given  the  direction  of  translation  and  the  image  displacements  of  a  set  of  en¬ 
vironmental  points,  the  relative  depths  of  these  points  can  be  computed  by  solving 
the  inverse  perspective  transform  [Roge76j.  Relative  depth  can  also  be  simply  in¬ 
ferred  from  the  position  of  a  feature  and  the  extent  of  its  displacement  relative  to 
an  FOE  or  an  FOC.  This  relation  is  expressed  as 
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where  Z  is  the  value  of  the  Z  component  of  an  environmental  point  at  time  t  +  1 , 
AZ  is  the  extent  of  environmental  displacement  along  the  Z  axis  from  time  t  to 
time  t  +  1 ,  D  is  the  distance  of  the  corresponding  image  point  from  the  FOE  or 
FOC  at  time  t ,  and  AD  is  the  displacement  of  the  image  point  from  time  t  to  time 
t  +  1 .  Thus,  the  Z  value  of  an  environmental  point  can  be  recovered  from  image 
measurements  in  units  of  AZ ,  or  what  has  been  termed  Time-Until-Contact  by 
Lee  [Lee76,  Lee80]  (figure  5a  and  5b).  To  the  degree  that  the  sensor  displacement 
can  be  accurately  monitored,  absolute  depth  of  surface  points  can  be  computed. 


AZ  AD 


Figure  5a.  Relation  between  relative  environmental  depth  and  the  ex¬ 
tent  of  image  displacement  v/ith  respect  to  the  FOE/C. 
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The  effects  of  composite  image  motions  produced  by  sensor  rotation  and  trans¬ 
lation  can  be  analyzed  as  follows  for  an  image  feature  /mt-  which  undergoes  a 
displacement  D  to  position  /„,  at  time  n  (figure  6a).  The  motion  can  be  de¬ 
scribed  as  an  initial  displacement  R  to  a  position  Jmi  due  solely  to  the  rotation 
of  the  sensor,  which  is  followed  by  a  displacement  T  from  Jm,  to  /„,•  along  the 
translational  displacement  path  determined  by  the  straight  line  containing  image 
points  Jmi  and  the  FOE  determined  by  the  translational  parameters. 


FOE/C 


Figure  6a.  Composite  Field  Structure. 


Figure  6b.  Error  Measure  from  Composite  Field  Structure 

These  structural  properties  will  be  used  to  develop  measures  describing  the 
consistency  of  a  given  image  displacement  with  hypothesized  sensor  rotation  and 
translation  parameters  (figure  6b).  As  above,  for  an  image  point  /mt- ,  the  rotational 
parameters  induce  an  image  displacement  to  some  position  Jm, .  This  point  and 
the  FOE  corresponding  to  a  particular  translational  axis,  determine  an  expected 
translational  displacement  path.  The  angle  between  this  displacement  path  and  the 


vector  /„,  -  Jmt  corresponds  to  the  discrepancy  !  .  the  image  displacement 

and  the  hypothesized  values  of  the  sensor  motion  \  lers.  We  will  utilize  this 
measure  to  evaluate  motion  parameters  with  respect  to  determined  displacement 
fields  in  chapters  VI  and  VII.  This  local  consistency  measure  has  also  been  used  in 
generalized  Hough  transforms  so  that  each  image  displacement  vector  can  scale  its 
vote  against  a  particular  set  of  motion  parameters  corresponding  to  the  extent  of 
this  determined  angle  [Stee83j. 


CHAPTER  IV 


PROCESSING  TRANSLATIONAL  MOTION 


Introduction 

In  this  chapter  we  present  a  procedure  for  processing  image  sequences  produced 
by  translational  motion.  The  computation  robustly  combines  the  determination 
of  the  translational  motion  parameters,  image  displacements,  and  environmental 
depths  of  visible  surfaces.  The  procedure  consists  of  two  basic  steps:  Feature 
Extraction  and  Search,  The  feature  extraction  process  finds  small  image  areas  which 
may  correspond  to  distinguishing,  and  therefore  trackable,  parts  of  environmental 
objects.  The  direction  of  translational  motion  is  then  found  by  a  search  across 
hypothesized  FOE/C  positions  to  determine  a  set  of  image  displacement  paths  for 
the  extracted  features  which  minimizes  an  error  measure  of  total  feature  mismatch 
along  these  displacement  paths,  and  also  yields  consistent  displacements  for  the 
features. 

The  feature  extraction  process  finds  distinctive  points  which  are  positioned  at 
points  of  high  curvature  along  contours  determined  by  simple  processes  such  as 
thresholding,  zero-crossing  extraction  and  local  contrast  measurements.  Particular 
forms  of  the  feature  extraction  process  can  lead  to  effective  and  very  rapid  compu¬ 
tation  on  proposed  image  processing  architectures. 

The  search  process  minimizes  an  error  measure  defined  over  a  unit  sphere,  with 
each  point  on  the  sphere  corresponding  to  a  different  direction  of  sensor  translation. 
A  given  direction  of  translation  constrains  the  motion  of  extracted  image  features 
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to  straight  lines  which  radiate  from  or  converge  onto  a  single  point  in  the  image 
plane.  Thus,  the  error  measure  associates  a  point  on  the  unit  sphere,  corresponding 
to  a  particular  translational  axis,  with  a  number  describing  the  degree  of  total 
feature  mismatch  along  the  displacement  paths  determined  by  the  translational 
axis.  Experiments  have  shown  this  error  measure  to  be  smooth  and  with  a  distinct 
minimum  in  a  large  neighborhood  about  the  correct  translational  axis.  This  allows 
simple  search  methods  to  be  effective. 

We  present  several  experiments  showing  the  results  of  applying  the  procedure 
in  various  situations.  The  experiments  indicate  that  it  is  robust  and  applicable  to  a 
wide  range  of  real  world  image  sequences.  In  the  next  chapter,  we  review  particular 
extensions  for  implementing  the  procedure  in  a  hierarchical  computational  frame¬ 
work,  dealing  with  independently  translating  objects,  translational  blur-streaks, 
and  implications  for  autonomous  navigation. 


The  feature  extraction  process  is  used  to  determine  small  areas  (referred  to  as 
image  points  or  features)  in  an  image  that  are  distinct  from  their  respective  neigh¬ 
boring  areas.  This  distinctiveness  limits  the  potential  matches  of  these  image  areas 
in  suceeding  images  and  suggests  the  possibility  that  these  points  may  be  trackable 
over  time.  These  image  features  may  also  reflect  a  correspondence  to  actual  and 
significant  features  in  the  environment,  such  as  points  of  high  curvature  on  object 
boundaries,  texture  elements,  surface  markings,  etc.  However,  there  are  some  fea¬ 
tures,  termed  false  features,  which  may  be  selected  but  which  result  from  noise, 
occlusion,  and  light  source  effects  and  have  behavior  which  is  currently  difficult  to 
interpret.  Features  can  be  represented  either  as  arrays  of  numbers  extracted  as  a 
subimage  directly  from  an  image,  or  as  parameterized  tokens  describing  local  image 
properties.  We  refer  to  features  exclusively  as  small  arrays  of  data  values  centered 
at  some  point  in  an  image  at  some  time  t . 

Following  Moravec  [Mora77,  Mora81],  the  method  of  feature  extraction  used 
here  is  based  upon  finding  image  areas  which  are  significantly  different  than  their 
neighboring  areas.  Using  correlation  measures  bounded  between  1  (for  perfect 
correlation)  and  0 ,  the  distinctiveness  of  a  feature  is  1  minus  the  best  correlation 
value  obtained  when  the  feature  is  correlated  with  its  immediately  neighboring  areas 
(excluding  correlation  with  itself).  Good  features  can  then  be  selected  by  finding 
the  local  maxima  in  the  values  of  the  distinctiveness  measure  over  an  image.  There 
are  several  metrics  available  for  similarity  of  two  n  x  n  arrays  -4,  y  and  Bt} .  We 
have  utilized  the  following  measures: 
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Normalized  Correlation 


i  o  Bjj\ 

E»  12  j  Ai,j  +  £,• 12  j  B*,j 

All  of  these  measures  have  a  value  of  1  for  a  perfect  match.  Of  these,  the  first 
choice  is  the  most  conventional,  the  second  is  a  good  approximation  to  the  first  and 
more  efficient,  and  the  third  is  the  quickest  to  evaluate. 

We  further  constrain  the  neighborhoods  over  which  the  features  are  selected 
to  contours  determined  by  other  processes,  such  as  zero-crossing  extraction  and 
thresholding,  which  are  sensitive  to  edges.  This  yields  interesting  points  which  are 
locally  distinctive  and  exhibit  high  curvature  along  extracted  contours  containing 
the  point. 


40 


Feature  Extraction  Using  Zero-Crossings 

The  use  of  zero-crossings  to  determine  significant  image  contours  at  different 
levels  of  resolution  has  been  proposed  and  extensively  studied  by  Marr  et.  al. 
[Hild80,  Marr80].  In  this  processing  an  image  is  convolved  with  Gaussian-Laplacian 
masks  (V2G)  of  different  positive  widths  and  thresholded  at  zero  to  determine 
zero-crossing  contours.  These  contours  are  significant  since  they  correspond  to  the 
points  of  greatest  change  in  the  convolved  image.  The  distinctiveness  measure  can 
be  applied  to  points  along  these  contours  in  the  convolved  image,  with  the  local 
maxima  determining  the  position  of  potential  features.  This  generally  has  the  effect 
of  finding  points  of  high  curvature  along  the  zero-crossing  contour,  although  points 
apparently  corresponding  to  local  occlusion  vertices  and  weak  maxima  will  also  be 
extracted. 

Many  weak  features  which  are  local  maxima  of  distinctiveness  can  be  removed 
by  suppressing  those  which  are  at  points  of  low  curvature  along  the  zero-crossing 
contours  (a  cheaper  method  for  dealing  with  this  is  presented  in  the  discussion  of 
this  chapter).  For  features  which  are  local  distinctiveness  maxima,  we  approximate 
the  curvature  along  the  contour  by  the  inner  product  of  the  normalized  vectors 
describing  the  relative  positions  of  the  nearest  local  maxima  along  the  contour 
(figure  7).  These  values  are  then  thresholded  between  1.0  (corresponding  to  high 
curvature)  and  -1.0  (corresponding  to  low  curvature)  to  reflect  feature  strength. 


Figure  7.  Computation  of  curvature  for  low  curvature  suppression  of 
extracted  features. 

The  images  in  figure  8a  and  figure  8b  were  taken  from  a  gyroscopically  stabi¬ 
lized  movie  camera  held  by  a  passenger  in  a  car  traveling  down  a  country  road  in 
Massachusetts  [WiII80j.  They  are  128x128  pixel  images  with  6  bits  of  resolution 
in  intensity  and  will  be  referred  to  as  the  roadsign  images.  Figure  8c  shows  the 
zero-crossings  extracted  from  the  initial  roadsign  image  using  a  V2G  mask  with  a 
positive  width  of  5  pixels.  The  distinctiveness  values  were  computed  using  features 
which  were  5x5  pixel  arrays  extracted  from  the  convolved  image  and  centered  on 
pixels  which  were  adjacent  to  the  zero-crossing  contour  and  of  positive  value.  These 
features  were  correlated,  using  Moravec’s  norm,  with  their  8  immediately  neighbor¬ 
ing  features.  Figure  8d  shows  the  local  maxima  in  the  distinctiveness  measure 
positioned  with  respect  to  the  zero-crossing  contour.  Figure  8e  shows  the  results  of 
suppressing  low-curvature  points  using  a  threshold  set  to  -0.8  radians  (143  degrees). 
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Use  of  features  based  on  zero-crossings  requires  specification  of  the  sizes  of  the 
convolution  masks  that  are  employed,  and  a  decision  whether  to  position  extracted 
feature  points  with  respect  to  the  unprocessed  image  or  the  convolved  images.  It 
is  usually  beneficial  to  use  masks  of  various  widths  for  sensitivity  to  features  at 
different  levels  of  resolution.  In  this  case,  the  translational  processing  described 
below  can  be  applied  independently  to  the  different  pairs  of  images  formed  by 
convolving  the  original  successive  images  with  the  different  masks.  Alternatively, 
as  was  done  above,  features  can  be  extracted  from  the  original,  unfiltered  image 
at  the  positions  where  features  were  determined  in  the  convolved  images,  though 
experience  with  large  masks  has  shown  that  this  approach  can  position  features 
significant  distances  from  their  apparent  position  in  the  original  image. 


8 


Feature  Extraction  Using  Threshold  Contours 


Another  simple  operation  to  determine  image  contours  is  thresholding.  The  val¬ 
ues  of  the  threshold  can  be  determined  in  a  variety  of  ways:  using  fixed  increments, 
finding  peaks  and  valleys  in  the  image  intensity  histogram,  or  using  techniques 
sensitive  to  image  contrast  across  the  contours  produced  by  a  particular  threshold 
[Kohl81,Wesk75]. 


The  images  in  figure  9a  and  9b  were  produced  from  a  solid  state  camera  held 
by  a  robot  manipulator  translating  toward  some  industrial  parts  lying  on  a  table. 
The  images  are  128x128  pixel  images  with  6  bits  of  intensity  resolution.  These  will 
be  referred  to  as  the  industrial  images.  Analysis  of  the  image  intensity  histogram, 
using  the  procedures  described  in  [Kohl81],  determined  a  clear  break  in  the  his¬ 
togram  at  an  intensity  level  of  10  in  the  image.  This  corresponded  to  separation 
of  the  dark  background  and  the  brighter  objects  in  the  scene.  Figure  9c  shows  the 
extracted  contour  and  figure  9d  the  local  maxima  in  the  distinctiveness  measure 
of  image  features  centered  on  pixels  adjacent  to  the  contour  and  of  intensity  value 
greater  than  or  equal  to  ten.  Figure  9e  shows  the  extracted  feature  points  after 
low  curvature  suppression  using  a  threshold  set  to  -0.8  radians  (corresponding  to 
an  angle  of  143  degrees). 


Figure  9e.  High  Curvature  Points  along  Threshold  Contour. 


Determining  the  Axis  of  Translation 

The  procedure  for  determining  the  translational  axis  minimizes  an  error  measure 
which  describes  the  extent  of  feature  mismatch  along  the  image  displacement  paths 
determined  by  an  hypothesized  translational  axis.  Note  that  the  image  displace¬ 
ments  are  determined  simultaneously  wi'h  the  direction  of  motion.  For  example, 
figure  10  shows  an  FOE  determined  by  a  potential  translational  axis  and  the  corre¬ 
sponding  image  displacement  paths  for  some  extracted  features.  Also  shown  is  the 
match  profile  for  correlation  of  a  particular  feature  along  a  segment  of  its  displace- 
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ment  path  in  the  succeeding  image.  The  adequacy  of  a  potential  translational  axis 
for  describing  the  motion  between  successive  images  is  measured  by  summing  the 
error  associated  with  the  best  match  for  each  of  the  features  along  their  respective 
image  displacement  paths. 


MATCH 

STRENGTH 


DISPLACEMENT  (PIXELS) 


Figure  10.  Translational  Displacement  Paths  for  a  hypothesized  FOE 
and  a  match  function  on  one  feature. 


The  set  of  all  possible  translational  axes  describes  a  unit  sphere  called  the 
translational  direction  sphere.  For  reasons  discussed  below,  the  search  procedures 
are  defined  with  respect  to  this  sphere,  rather  than  the  image  plane  itself.  The 
error  measure  associates  a  point  on  the  direction  of  translation  sphere  with  a  num¬ 
ber  describing  the  quality  of  feature  matches  along  the  image  displacement  paths 
determined  by  the  corresponding  hypothesized  translational  axis.  This  error  value 
is  computed  by  first  finding  the  best  match  for  each  feature  along  a  segment  of  its 
image  displacement  path  using  one  of  the  normalized  match  metrics  above.  Each 
of  these  values  is  then  subtracted  from  one,  and  all  the  resulting  values  are  added 
together  to  form  an  error  measure.  Thus,  for  a  set  of  N  features  in  an  initial  image, 
a  hypothesized  translational  axis,  and  use  of  one  of  the  match  metrics  above,  the 
error  measure  E  is 


E  =  £[1.0  -  bestmatch(i)\ 


where  bestmatch(i)  is  the  best  match  value  associated  with  feature  i  along  the 
appropriate  image  displacement  path. 


The  error  measure  utilizes  the  different  correlation  norms  described  above  and 
different  interpolation  processes  for  determining  positions  along  an  image  displace¬ 
ment  path.  The  choices  among  these  generally  involve  a  trade-off  between  the  speed 
of  evaluating  the  error  measure  and  the  precision  with  which  the  translational  axis 
can  be  determined. 


The  interpolation  process  approximates  feature  values  along  the  image  displace¬ 
ment  path  from  one  image  onto  another.  Depending  on  the  accuracy  required,  po¬ 
sitions  along  the  image  displacement  path  can  be  approximated  roughly  by  setting 
the  coordinates  of  the  feature’s  position  to  the  nearest  integer  value,  or  more  ac¬ 
curately  by  performing  a  bilinear  subpixel  interpolation  of  the  feature  at  each  of  a 
set  of  selected  positions  along  the  image  displacement  path.  The  basic  trade-off  is 
between  speed  and  accuracy,  with  subpixel  interpolation  being  more  expensive. 

The  error  measure  was  computed  in  two  forms  in  the  experiments  below:  a  fast 
evaluation  form  and  a  precise  evaluation  form.  The  fast  form  uses  the  absolute 
value  norm  and  the  nearest  integer  approximation  to  determine  feature  position 
along  the  image  displacement  paths.  The  fast  form  is  useful  for  evaluating  image 
sequences  with  several  extracted  features  to  determine  the  rough  position  of  the 
global  minimum.  However,  the  fast  form  may  not  be  adequate  for  fine  determination 
of  the  translational  axis  because  of  the  nearest  integer  approximation  for  feature 
position. 

The  precise  form  of  evaluation  uses  the  Moravecnorm  and  bilinear  interpolation. 
It  has  been  found  to  vary  smoothly  with  respect  to  small  changes  in  the  position  of 
a  translational  axis. 


Utility  of  the  Direction  of  Translation  Sphere 


There  are  significant  advantages  in  defining  the  error  measure  with  respect  to 
a  unit  sphere  instead  of  the  potential  positions  of  FOEs  and  FOCs  in  the  image 
plane.  The  sphere  is  a  bounded  surface  which  makes  uniform  global  sampling  of 
the  error  measure  feasible.  In  contrast,  when  the  image  plane  is  used  directly,  the 
resolution  in  the  position  of  the  translational  axis  varies.  For  example,  the  FOEs 
determined  by  translational  axes  separated  by  very  small  angles  will  be  separated 
by  larger  and  larger  distances  in  the  image  plane  as  FOEs  are  placed  further  from 
the  visible  image.  The  effect  of  using  the  image  plane  on  the  error  measure  is  a 
loss  of  resolution  with  large  flat  areas  surrounding  FOEs  that  are  distant  from  the 
visible  portions  of  the  image. 

Finally,  special  criteria  must  be  used  to  distinguish  between  FOEs  and  FOCs 
if  the  error  measure  is  defined  relative  to  the  image  plane.  Roughly  parallel  image 
displacements  could  correspond  to  an  FOE  off  to  one  side  or  an  FOC  off  to  the 
opposite  side  of  the  image  plane.  On  the  direction  of  translation  sphere,  the  cor¬ 
responding  translational  axes  would  be  close,  while  on  the  plane  they  are  widely 
separated  at  plus  and  minus  infinity. 


Search  Organization 

The  search  process  used  here  consists  of  two  phases:  An  initial  global  sampling  of 
the  error  measure  to  determine  its  rough  shape  and  then  a  local  search  to  determine 
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a  minimum.  The  local  search  begins  at  the  position  where  the  minimum  value 
was  determined  by  the  global  sampling.  The  local  search  is  a  gradient  descent 
procedure  using  a  diminishing  step-size.  That  is,  it  begins  with  an  initial  fixed 
step  size  and  determines  a  local  minimum  using  it.  The  step-size  is  then  reduced 
and  the  procedure  repeated  until  the  step-size  is  at  the  desired  resolution  for  the 
determination  of  the  translational  axis.  In  the  experiments  below  the  initial  step- 
size  was  set  to  0.1  and  then  reduced  successively  to  0.025  and  0.005  radians. 

As  will  be  seen  in  the  following  experiments,  the  error  measure  is  smooth,  with 
a  single  minimum  in  a  large  neighborhood  around  the  correct  translational  axis. 
Thus,  the  global  sampling  can  be  quite  sparse  or  the  initial  step  size  of  the  local 
search  quite  large. 


Experiments 

The  following  experiments  were  performed  using  the  roadsign  and  industrial 
image  sequences.  They  represent  a  wide  range  of  situations.  The  first  experiment 
involves  determining  the  translational  axis  from  the  industrial  image  sequence  using 
the  features  indicated  in  figure  9e.  In  this  sequence  the  translational  axis  intersects 
the  image  plane  in  a  visible  portion  of  the  image.  The  second  experiment  involves 
processing  the  industrial  image  sequence  using  a  smaller  number  of  features.  In 
the  third  experiment  the  roadsign  image  sequence  is  processed  using  the  features 
extracted  at  the  positions  indicated  in  figure  8e.  Here,  the  intersection  of  the 
translational  axis  and  the  image  plane  is  not  in  the  visible  portion  of  the  image. 


The  fourth  experiment  involves  processing  the  roadsign  image  sequence,  but  using 
the  features  extracted  prior  to  low-curvature  suppression.  This  has  the  effect  of 
introducing  weak  and  spurious  features  into  the  error  measure  computation.  The 
fifth  experiment  involves  processing  the  roadsign  images  using  features  extracted 
from  a  small  area  of  the  initial  image. 

In  all  of  the  experiments,  the  maximal  displacement  along  an  image  displace¬ 
ment  path  was  set  to  10  pixels.  Displacements  were  in  increments  of  1  pixel  along 
the  image  displacement  paths.  Features  were  7x7  pixel  arrays  centered  at  the  posi¬ 
tions  indicated  in  the  figures. 

We  use  a  2  -D,  polar  coordinate  system  to  describe  the  points  on  the  direction 
of  translation  sphere  over  which  the  error  measure  is  evaluated.  The  axes  of  trans¬ 
lation  are  unit  vectors  based  at  the  origin  of  the  camera  coordinate  system  and  are 
described  by  two  angles  (^1,^2)  (figure  11).  For  an  axis  of  translation,  V  ,  based  at 
the  origin,  <py  is  the  angle  between  the  (0, 1,0)  vector  and  the  edge  determined  by 
the  intersection  of  the  Y Z  plane  and  the  plane  determined  by  the  X  axis  and  V . 
<pi  thus  specifies  one  of  the  pencil  of  planes  containing  the  X  axis.  ^2  is  then  used 
to  express  V  as  a  vector  in  the  specified  plane.  <p2  is  tin  angle  between  (—1,0,0) 
and  V  .  Note  that  for  all  angles  a  and  6,  (o,0)  =  (6,0)  and  (a,  it)  =  (6,  jt)  which 
corresponds  to  points  lying  along  the  X  axis. 


Figure  11.  Coordinate  System  for  Describing  Translational  Axes 

For  each  experiment,  the  results  of  processing  are  contained  in  3  tables.  The 
first  two  (tables  a  and  b)  indicate  the  values  of  the  error  measure  during  the  global 
sampling  of  points  using  a  fixed  angular  increment  (equal  to  ^  or  18  degrees)  in 
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(<p ufa)  coordinates  on  the  direction  of  translation  sphere.  The  first  of  these  tables 
corresponds  to  translational  axes  which  intersect  the  image  plane  at  FOEs.  The 
second  basically  corresponds  to  those  which  intersect  the  image  plane  at  FOCs. 
Each  of  these  tables  is  also  presented  as  an  intensity  plot  and  a  contour  plot.  In  the 
intensity  plot,  error  is  proportional  to  intensity  so  darker  areas  imply  lower  values 
of  error.  In  the  contour  plots,  the  positions  of  local  minima  are  marked  with  a  “  -  " 
and  the  local  maxima  are  marked  with  a  “  +  ”.  Certain  distortions  appear  in  these 
figures  because  they  result  from  mapping  tk '  unit  sphere  onto  planes.  Thus  values 
near  the  right  and  left  hand  sides  of  the  figures  are  actually  closer  to  each  other 
on  the  unit  sphere  than  those  points  nearer  the  center.  Additionally,  the  positions 
on  the  extreme  left-hand  side  of  the  figures  actually  correspond  to  the  same  point 
on  the  direction  of  translation  sphere  which  flattens  the  error  surface  plots  at  these 
positions. 

The  third  table  (table  c)  shows  the  minimal  value  determined  by  the  global 
sampling  process  that  is  used  to  initiate  the  local  search,  and  the  successive  values 
of  the  error  measure  determined  during  the  local  search.  In  this  table,  the  position 
of  the  translational  axis  is  referred  to  in  terms  of  ( X ,  Y,  Z)  camera  coordinates, 
in  addition  to  (^1,^2)  coordinates,  so  that  translational  axes  computed  under 
different  situations  can  be  compared. 
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Industrial  Images 

The  procedure  was  applied  to  the  industrial  images  using  the  features  ex¬ 
tracted  at  the  positions  shown  in  figure  9e.  Tables  la  and  lb  show  the  global 
sampling  of  the  error  measure  using  the  fast  form  of  evaluation.  Note  the  min¬ 
ima  at  (^1,^2)  =  (5^,4^)  =(1.571,1.257)  radians.  Table  lc  shows  the  successive 
values  of  the  local  search  using  the  precise  form  of  evaluation.  The  determined 
translational  axis  is  (-0.139,  -0.099,  0.985).  The  image  displacements  determined 
for  these  features  are  shown  in  figure  12. 


Figure  12.  Industrial  Image  Displacements. 
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Table  lc.  Industrial  Image  Local  Search  Values 


*  Denotes  this  error  value  was  computed  using  the  fast  evaluation  form.  The 
other  values  were  computed  using  the  precise  evaluation  form. 


Industrial  Images  with  Selected  Features 


The  procedure  was  again  applied  to  the  industrial  image  sequence  but  using 
features  which  were  selected  by  hand.  The  positions  of  these  8  features  are  shown 
in  figure  14. 

Tables  2a  and  2b  show  the  global  sampling  of  the  error  measure  using  the 
precise  form  of  evaluation.  Note  the  minima  at  (^i,<fo)  =  (5-pj)5fjj)-  Table  2c 
shows  the  successive  position  determined  by  the  lo:al  search.  The  translational 
axis  was  determined  to  be  (-0.154,  -0.079,  0.985).  This  corresponds  to  an  angular 
difference  of  0.025  radians  (1.45  degrees)  with  respect  to  the  axis  determined  in 
experiment  1. 


Figure  14.  Selected  Features  from  Industrial  Image  1. 
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Table  2a.  Industrial  Image  Selected  Feature  Error  Values 
2  13  14  15  16 
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Table  2b.  Industrial  Image  Selected  Feature  Error  Values 

1.5708 

1.5708 

0.00000 

0.00000 

1.00000 

0.57998 

0.1 

1.6708 

1.3708 

-0.19867 

-0.09785 

0.97517 

0.19955 

0.025 

1.6458 

1.4208 

-0.14943 

-0.07401 

0.98599 

0.17476 

0.005 

1.6508 

1.4158 

-0.15438 

-0.07896 

0.98485 

0.17410 

Table  2c.  Industrial  Image  Selected  Feature  Local  Search  Values 


Figure  15a.  Intensity  plot  of  Table  2a, 


Figure  15b.  Intensity  plot  of  Table  2b 
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Roadsign  Image  Sequence 

The  procedure  was  applied  to  the  roadsign  image  sequence  using  the  features 
extracted  at  the  positions  indicated  in  figure  8e.  Tables  3a  and  3b  show  the  global 
sampling  of  the  error  measure  using  the  fast  form  of  evaluation.  Note  the  minima 
at  (0i,  02)  =  (8^,  2j^)  .  Table  3c  shows  the  successive  values  of  the  local  search 
using  the  precise  form  of  evaluation  for  the  error  measure.  The  translational  axis 
determined  by  this  process  is  (-0.837,  -0.420,  0.349).  The  image  displacements  for 
the  feature  points  shown  in  figure  8e  that  are  associated  with  this  translational  axis 
are  shown  in  figure  17. 

Given  the  direction  of  translation  and  image  displacements,  the  relative  environ¬ 
mental  depths  of  image  points  can  be  recovered  by  the  simple  relation  in  equation 
ten  from  chapter  III.  When  image  displacements  are  small,  the  inferred  depth  values 
can  be  quite  erratic  due  to  sensitivity  to  small  numbers  in  the  denominator  in  the 
left  hand  side  of  this  equation.  For  this  reason  it  is  necessary  to  use  image  pairs 
for  which  large  displacements  can  be  determined.  One  way  to  do  this  for  image 
sequences  which  are  related  by  successive  sensor  translations  is  to  track  the  FOE 
from  a  given  image  with  respect  to  successive  later  image.  This  was  done  with  four 
successive  images  from  the  roadsign  sequence  beginning  with  roadsign  images  1  and 
2  and  using  the  features  from  image  1  at  the  positions  in  figure  8e.  The  position 
of  the  translational  axis  determined  from  images  1(1)  and  I(t+1)  was  used  as  the 
initial  value  in  the  local  search  for  determining  the  translational  axis  for  images 
1(1)  and  I(t+2),  where  t  ~  1,2  in  this  example.  The  displacements  of  all  features 


along  the  contour  in  figure  8c  were  determined  along  the  image  displacement  paths 
determined  by  the  FOE  found  f^r  images  1(1)  and  1(4).  To  compute  depth  along 
the  contours,  5x5  windows,  centered  at  each  contour  point,  were  matched  along  the 
image  displacement  paths  and  the  displacement  corresponding  to  the  best  match 
were  determined.  The  resulting  relative  depth  map  is  shown  in  figure  18  where 
depth  is  encoded  by  intensity  (more  distant  things  are  brighter). 

The  roadsi^n  sequence  is  particularly  nice  for  presenting  depth  processing  results 
because  the  three  environmental  objects  in  the  images  are  at  three  distinct  depth 
intervals.  This  is  shown  in  figure  19  by  the  three  distinct  clusters  in  the  histogram 
of  the  depth  values  calculated  for  the  points  along  the  contour.  The  units  in  the 
histogram  are  cumulative  time-until-contact  values.  That  is,  the  depth  is  given  in 
units  of  the  displacement  of  the  camera  from  1(1)  to  1(4)  along  the  Z-axis.  From 
left  to  right,  the  first  peak  corresponds  to  the  sign,  the  second  to  the  pole,  and  the 
third  to  the  trees.  As  can  be  seen,  there  is  a  wide  range  of  depths  associated  with 
the  trees.  Mapping  these  clusters  back  onto  contour  points  from  figure  8c  yields  the 
distinct  objects:  the  boundary  shown  in  figure  20a  (the  sign),  the  boundary  shown 
in  figure  20b  (the  pole),  the  boundary  segment  shown  in  figure  20c  (the  trees). 
Points  near  the  image  boundary  of  1(1)  were  ignored  because  the  processing  did  not 
take  into  account  occlusion  effects  along  the  image  boundaries. 
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Table  3a. 

Roadsign  Image  Error  Values 
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Table  3b.  Roadsign  Image  Error  Values 
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Table  3c.  Roadsign  Image  Local  Search  Values. 


*  Denotes  this  error  value  was  computed  using  the  fast  evaluation  form.  The 
other  values  were  computed  using  the  precise  evaluation  form. 
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The  procedure  was  applied  to  the  roadsign  image  sequence  using  the  features 
which  were  extracted  prior  to  low-curvature  suppression.  The  positions  of  these 
features  is  shown  in  figure  8d.  This  has  the  effect  of  including  several  weak  and 
false  features  in  the  evaluation  of  the  error  measure. 

Tables  4a  and  4b  show  the  values  of  the  global  sampling  of  the  error  measure 
using  the  fast  form  of  evaluation.  Note  the  minima  at  (^1,^2)  =  Table 

4c  shows  the  successive  values  of  the  local  search.  The  determined  translational 
axis  was  (-0.829,-0.423,0.366).  This  corresponds  to  an  angle  of  0.019  radians 
(1.068  degrees)  with  respect  to  the  axis  determined  in  experiment  3. 
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Table  4a.  Roadsign  Redundant  Feature  Error  Values. 
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Table  4b.  Roadsign  Redundant  Feature  Error  Values 
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Stepsize 

<t>  2 
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Z 

Error 

2.5133 

0.62832 

-0.80902 

-0.47554 
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II 
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Table  4c.  Roadsign  Redundant  Feature  Local  Search  Values 
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‘Denotes  this  error  value  was  computed  using  the  fast  form  of  evaluation.  All 
other  values  were  computed  using  the  precise  form. 
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Roadsign  Subimage 

This  experiment  was  conducted  to  test  the  accuracy  of  the  algorithm  when 
applied  to  a  very  small  area  of  the  visual  field.  The  procedure  was  applied  to  the 
roadsign  image  sequence  with  features  restricted  to  the  rectangular  area  shown  in 
figure  22  corresponding  to  texture  in  the  distant  trees. 

Tables  5a  and  5b  show  the  values  of  the  global  sampling  of  the  error  measure 
using  the  precise  form  of  evaluation.  Note  the  minima  at  (^1,^2)  =  (7^,  2-^) . 
Table  5c  shows  the  successive  values  determined  by  the  local  search.  The  transla¬ 
tional  axis  is  determined  to  be  (—0.843,  —0.429,0.325) .  This  corresponds  to  angles 
of  0.027  radians  (1.53  degrees)  and  0.044  (2.516  degrees),  with  respect  to  the  trans¬ 
lational  axes  determined  in  experiments  3  and  4  respectively. 
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Table  5a.  Roadsign  Subimage  Error  Values 
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Table  5b.  Roadsign 

i  Subimage  Error  Values 
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-0.80902 

-0.34549 

0.47553 

0.059910 
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-0.42928 
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0.059269 

Table  5c.  Roadsign  Subimage  Local  Search  Values 


Discussion 


The  experiments  presented  here,  as  well  as  others,  have  shown  that  the  proce¬ 
dure  is  robust  in  several  important  ways.  It  is  resilient  with  respect  to  weak  and 
false  features  and  is  not  dependent  on  identical  features  being  extracted  in  succes¬ 
sive  images  prior  to  matching.  It  can  use  a  small  number  of  features  positioned 
across  an  image  surface,  or  a  small  number  of  features  from  a  limited  area  of  the 
image. 

In  the  remainder  of  this  chapter,  we  discuss  the  feature  extraction  process  and 
how  it  may  be  made  more  efficient,  and  the  general  behavior  of  the  error  measure.  In 
the  next  chapter  we  explore  several  potential  extensions  of  the  translational  motion 
procedure. 


Feature  Extraction 


Since  the  procedure’s  performance  does  not  degrade  severely  due  to  the  occur¬ 
rence  of  poor  features,  the  type  of  feature  extraction  used  is  not  critical.  Nonethe¬ 
less,  the  feature  extraction  process  developed  here  could  be  extended  in  many  ways. 
A  simple  one  is  to  constrain  the  extraction  of  interesting  points  to  positions  where 
image  contrast  exceeds  some  minimal  value.  Also,  other  types  of  contour  extraction 
can  be  used.  For  example,  contours  can  also  be  determined  by  local  application  of 
histogram  guided  thresholding  and  segmentation.  This  resolves  some  of  the  prob¬ 
lems  associated  with  using  a  single  threshold  determined  for  image  subparts  with 


significantly  different  brightnesses  [Kohl81j. 

A  significant  question  concerns  the  speed  at  which  features  are  extracted.  Lo¬ 
cality  of  processing  leads  to  the  most  efficient  computation  in  array  processing 
architectures.  In  the  procedure  here,  the  technique  of  contour  walking  to  determine 
curvature  is  significantly  non-local.  Since  the  algorithm  is  robust  with  respect  to 
weak  features,  the  use  of  less  costly  methods  for  extraction  of  possibly  weaker  fea¬ 
tures  may  be  acceptable.  It  may  be  possible  to  directly  determine  points  of  high 
curvature  by  using  corner  finders  [Kitc80,  Zuni83j. 

Another  alternative  to  the  contour  walking  is  to  simply  use  a  threshold  on  the 
distinctiveness  measures,  with  or  without  the  determination  of  local  maxima  in 
distinctiveness.  Examination  of  the  local  maxima  along  the  telephone  pole  in  figure 
2c,  reveals  that  these  are  local  maxima  with  very  small  distinctiveness  measures. 
This  has  been  observed  in  general. 

An  additional  speed-up  can  be  obtained  when  features  are  selected  from  con¬ 
tours  determined  by  segmentation  procedures  (such  as  thresholding  or  zero-crossing 
extraction)  which  produce  binary  images  where  pixel  values  may  be  represented  by 
1  or  -1.  In  this  case  there  is  no  need  to  normalize  the  correlation  measure  used 
to  determine  distinctiveness  because  each  image  subarea  of  equal  size  has  identical 
constant  image  energy  [Duda73j.  Thus,  the  normalizing  terms  in  the  correlation 
measures  become  constants  and  the  arithmetic  operations  are  restricted  to  products 
or  additions  over  the  set  (1,  — 1) .  When  the  distinctiveness  measures  are  determined 
along  the  contours  of  binary  images  followed  by  a  threshold  on  distinctiveness  and 
local  maximal  extraction,  very  rapid  rates  of  feature  extraction  can  be  achieved 


in  the  particular  architectures  we  have  explored,  on  the  order  of  a  fraction  of  a 
millisecond  [Lawt84]. 

The  binary  image  in  Figure  24  was  determined  by  thresholding  at  zero  the 
initial  roadsign  image  with  the  V2G  mask  used  above.  Figure  25  shows  the  in¬ 
teresting  points  extracted  from  the  binary  image  in  figure  24  using  a  threshold  on 
distinctiveness  set  to  0.1  followed  by  local  maxima  extraction.  The  results  are  rea¬ 
sonable,  although  mistakes  can  occur  if  the  neighborhoods  over  which  local  maxima 
are  computed  contain  points  of  high  curvature  from  distinct  regions.  This  could 
be  remedied  by  restricting  the  calculation  of  distinctiveness  for  points  only  along 
contours  of  the  same  region  (which  would  then  require  the  determination  of  region 
labels  via  a  connected  components  algorithm). 


Figure  24.  Binary  Roadsign  Image. 


Figure  25.  Interesting  Points  along  Contours. 

It  would  also  be  useful  to  incorporate  information  determined  from  the  extrac¬ 
tion  of  the  translational  axis  to  isolate  false  features.  This  could  involve  removing 
from  the  error  measure  those  features  which  have  weak  matches  once  a  translational 
axis  has  been  determined,  and  re-evaluating  to  refine  the  FOE.  Such  a  filtering  pro- 


cess  would  be  particularly  helpful  when  the  total  minimum  error  was  not  sufficiently 
low  thereby  casting  doubt  on  the  correctness  or  accuracy  of  the  solution.  Alter¬ 
natively,  the  depth  inferences  could  be  used  to  isolate  the  positions  of  potential 
false  features  by  noting  discontinuities  in  depth  along  an  extracted  contour.  Such 
features  tend  to  be  associated  with  vertices  generated  by  surface  occlusion.  Such 
extracted  features  could  be  removed  from  the  re-evaluation  of  the  error  measure  if 
they  are  at  or  near  such  positions. 

Another  type  of  feature  which  can  affect  the  evaluation  of  the  error  measure 
are  those  near  an  FOG  or  FOC  which  is  contained  in  a  visible  portion  of  the  image. 
Such  features  tend  to  move  very  small  amounts  along  their  image  displacement 
paths  and  hence  require  fine  interpolation  to  determine  their  best  matches.  The 
depth  inference  associated  with  such  points  tend  to  be  highly  erratic  since  their 
use  in  the  inference  relation  from  chapter  IQ  involves  dividing  a  small  number  by 
another  small  number. 


Properties  of  the  Error  Measure 


In  the  experiments  presented,  the  error  measure  has  a  distinct  global  minimum 
at  the  point  on  the  unit  sphere  corresponding  to  the  correct  translational  axis.  It 
is  generally  expected  to  have  such  behavior  because  it  is  very  unlikely  that  trans¬ 
lational  axes  that  are  far  from  the  correct  position  will  define  image  displacement 
paths  that  simultaneously  allow  good  matches  for  many  features.  Thus,  competing 
candidates  for  the  global  minimum  are  not  expected  to  be  widely  separated.  This 
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reasoning  implies  strong  unimodality  and  smoothness  of  the  error  measure  over  a 
large  neighborhood  and  this  has  been  confirmed  empirically.  Therefore,  the  opti¬ 
mization  procedure  used  here  could  be  replaced  by  other  techniques  which  generally 
have  faster  convergence. 

The  error  measure  is  affected  by  both  non-distinctive  and  false  features.  Non- 
distinctive  features  will  match  well  for  many  different  translational  axes.  Large 
numbers  of  these  weak  features  will  flatten  the  response  of  the  error  measure.  False 
features  will  also  distort  the  error  measure  since  they  will  often  have  their  best 
matches  with  incorrect  translational  axes. 

The  effects  of  these  poor  features  should  be  compensated  by  the  agreement  of 
good  features.  Every  one  of  the  good  features  will  tend  to  have  a  bad  match  for 
the  incorrect  translational  axis  and  their  unanimity  is  expected  to  override  the  lack 
of  discrimination  of  weak  features  and  the  random  quality  of  the  matches  of  false 
features.  However,  there  is  a  limit  in  the  percentage  of  weak  and  false  features  before 
the  algorithm  will  degrade.  This  limit  has  not  been  explored,  but  our  experience 
suggests  that  it  may  be  quite  high,  with  perhaps  as  many  as  50  percent  of  the 
features  being  ineffective. 


CHAPTER  V 


EXTENSIONS  TO  TRANSLATIONAL  MOTION  PROCESSING 

Introduction 

In  this  chapter  we  discuss  several  extensions  to  the  translational  motion  proce¬ 
dure.  We  begin  by  formulating  the  computation  hierarchically.  This  significantly 
increases  the  computational  speed  of  the  procedure  and  the  extent  of  image  dis¬ 
placements  that  can  be  processed.  We  then  show  how  to  process  the  blur  paths  of 
nearby  textured  surfaces  when  prolonged  exposures  are  used  during  translational 
motion.  We  note  the  implications  of  this  case,  both  for  processing  computed  trans¬ 
lational  displacement  fields,  and  for  using  blur  to  determine  image  displacements 
in  general.  The  third  extension  to  our  algorithm  considers  different  approaches  for 
processing  image  sequences  containing  multiple,  independently  translating  objects. 
One  of  these  is  based  upon  generalized  Hough  techniques  to  decompose  the  error 
measure  response  into  the  effects  of  the  different  objects.  The  others  are  based  upon 
local  application  of  the  procedure  to  image  subareas  determined  by  segmentation 
or  image  subdivision.  Finally,  we  consider  the  use  of  translational  motion  process¬ 
ing  for  autonomous  vehicle  navigation  by  using  devices  to  stabilize  the  sensor  or  to 
obtain  the  rotational  parameters  directly. 


A  basic  paradigm  in  computer  vision  is  the  use  of  hierarchical  representations 
and  processes  [Burt82,  Glaz83a,  Glaz83b,  Hans80,  Tani80,  Uhr78).  This  allows  dif¬ 
ferent  magnitudes  and  scales  of  image  events  to  be  handled  uniformly.  Additionally, 
the  consistent  agreement  among  hierarchically  organized  processes  is  a  basic  control 
strategy  for  a  wide  range  of  high  and  low  level  interpretation  tasks.  Hierarchical 
processing  can  produce  significant  computational  reductions,  wherein  results  from 
processing  performed  rapidly  at  lower  resolutions  of  image  information  are  used  to 
direct  and  constrain  more  detailed  and  extensive  processing  of  higher  resolution 
image  information. 

The  processing  of  translational  motion  can  be  developed  in  a  hierarchical  fashion 
with  the  primary  benefits  being  increased  speed  and  the  ability  to  deal  with  larger 
image  displacements.  This  requires  specifying  the  hierarchical  representations  of 
the  successive  images  and  the  extracted  features,  and  specifying  how  processing  at 
different  levels  of  image  resolution  are  related. 

Hierarchical  Representation  of  Images  and  Features 


In  the  initial  work  described  here,  images  have  been  represented  in  the  VISIONS 
image  operating  cone  structure  [Hans80].  This  consists  of  a  sequence  of  images 
To,  ht  h,  —h »  where  the  successive  sizes  of  the  images  are  lxl, 2x2, 4x4, ...,  2"  x 
2"  .  The  value  n  is  the  level  of  the  image  in  the  cone.  Each  pixel  in  the  *  - 
th  image,  except  for  the  first  and  last  images,  has  a  connected  neighborhood  of 
immediate  descendants  in  the  i  + 1  image  and  a  parent  in  the  *  -  1  image.  The 
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size  and  shape  of  the  immediate  descendant  neighborhood  can  be  arbitrary  and  the 
immediate  descendent  neighborhoods  of  adjacent  pixels  may  or  may  not  overlap. 

There  are  several  ways  to  reduce  the  resolution  of  an  image  in  the  VISIONS 
cone  [Hans80]  and  other  pyramid  architectures  [Burt82,  Tani80,  Uhr78).  These 
techniques  involve  smoothing  the  image  with  some  operator  and  then  sampling  at 
a  reduced  interval,  or  by  using  a  reduction  operator  which  is  some  function  of  the 
pixels  in  the  immediate  descendent  neighborhoods.  The  results  of  reducing  image 
resolution  by  averaging  using  Gaussian  masks  over  5x5  pixel  immediate  descendent 
neighborhoods  at  successive  levels  of  the  roadsign  image  1  is  shown  in  figures  26a-d. 

The  positions  of  extracted  features  can  also  be  represented  in  the  cone  structure 
at  different  levels  of  resolution.  There  are  several  alternatives  for  doing  this.  First, 
it  may  not  be  necessary  to  extract  features  at  all  and  simply  apply  the  procedure 
uniformly  to  features  at  each  position,  relying  on  the  increased  speed  of  hierarchical 
computation  or  potential  architectures  to  make  this  possible.  One  approach  is  to 
apply  the  feature  extraction  process  for  each  image  at  each  level  of  image  resolution. 
Another  technique  is  to  extract  features  in  the  highest  resolution  image  and  then 
treat  the  ancestors  of  these  in  the  lower  resolution  images  to  be  features.  In  this 
case,  the  immediate  descendent  neighborhoods  should  not  overlap  (so  each  feature 
has  unique  ancestors).  A  feature  is  then  positioned  at  a  parent  pixel  if  any  of  its 
descendants  are  at  positions  where  a  feature  has  been  extracted.  These  approaches 
may  interact  in  interesting  ways  if  the  strength  of  a  feature  is  expressed  as  a  function 
of  its  own  distinctiveness  and  that  of  its  descendants.  We  have  thus  far  utilized  the 
approach  based  upon  extracting  features  at  the  highest  image  resolutions,  though 
general  problems  with  this  should  be  noted.  Features  that  are  separated  at  higher 
resolutions  become  adjacent  at  lower  resolutions.  Thus,  the  inferred  features  at  the 
lower  resolutions  may  not  be  meaningful,  especially  since  the  information  is  not 
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uniform  across  the  range  of  spatial  frequencies  represented  in  the  different  image 
resolutions.  The  benefit  of  this  technique  is  that  there  are  explicit  and  unique  links 
between  features  at  different  image  resolutions  so  that  displacements  determined  at 
coarse  levels  can  be  used  to  initialize  the  estimates  of  displacements  at  finer  levels. 

Figures  27a-d  show  the  features  resulting  for  roadsign  image  1  at  different  levels 
of  resolution  by  using  the  feature  positions  determined  from  the  highest  level  of 
image  resolution  (figure  8e  in  chapter  IV)  at  the  corresponding  positions  in  the 
lower  resolution  images. 


Figure  27a.  128  x  128  Resolution. 


Figure  27b.  64  x  64  Resolution 


Figure  27c.  32  x  32  Resolution.  Figure  27d.  16  x  16  Resolution 
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Translational  Processing  at  Different  Resolutions 

The  translational  processing  can  be  applied  to  successive  images  at  any  level  of 
resolution  for  which  features  have  been  extracted  from  the  initial  image.  The  basic 
questions  concern  how  processing  at  one  level  affects  processing  at  another  level.  In 
particular,  how  do  processing  results  at  a  coarser  level  of  resolution  constrain  the 
processing  at  finer  levels  of  resolution?  At  what  level  in  the  cone  can  processing  be 
meaningfully  initialized?  How  do  the  various  parameters  involving  feature  window 
size,  displacement  resolution  along  a  flow  path,  and  resolution  of  the  optimization 
procedure  change  at  different  levels  of  the  cone? 

Let  us  present  our  first  effort  to  deal  with  these  issues.  For  a  given  pair  of  im¬ 
ages  at  level  t  in  the  cones  formed  from  successive  images,  the  translational  error 
measure  will  be  minimized  for  the  set  of  features  determined  at  level  »  (using  the 
ancestors  of  features  determined  from  the  highest  resolution  version  of  the  initial 
image).  The  position  of  the  minimum  error  in  the  translational  axis  at  level  *  is 
then  used  to  constrain  the  optimization  of  the  error  function  for  the  images  and 
feature  positions  at  the  t  +  1  level  in  the  cone.  In  addition  to  constraints  on  the 
position  of  the  error  function  minimum,  processing  higher  in  the  cone  constrains 
the  evaluation  of  the  potential  displacements  of  extracted  features  along  their  dis¬ 
placement  paths.  Figure  28  shows  flow  paths  at  different  levels  of  resolution.  For 
each  displacement  determined  at  level  i  only  three  positions  have  to  be  evaluated 
at  level  *  +  1 .  Thus,  not  only  is  the  minimum  of  the  error  function  passed  on, 
but  also  the  displacements  of  parent  features  which  are  then  used  to  constrain  the 
evaluation  of  the  displacements  of  descendent  features  [Glaz83b]. 


Figure  28.  Relations  between  displacements  at  different  resolutions. 


There  are  a  wide  range  of  possibilities  for  relating  the  error  function  minimisa¬ 
tion  across  the  different  image  resolutions.  One  strategy  that  has  been  employed 
involves  the  use  of  different  step  sizes  in  the  error  function  evaluation  correlated  with 
particular  image  levels.  That  is,  as  processing  moves  to  higher  image  resolutions, 
the  stepsize  of  the  error  function  evaluation  decreases.  Alternatively,  a  complete 
search  could  be  done  at  a  given  level  before  proceeding  to  the  higher  resolutions. 
Feature  size  can  also  change  as  processing  goes  down  the  cone  since  at  higher  levels 


a  given  window  size  corresponds  to  an  increased  area  with  respect  to  the  image. 
At  a  high  level  of  resolution,  features  described  by  small  image  areas  may  not  be 
distinctive  enough  to  match  well. 

In  the  experiments  in  figures  29a-d  processing  was  initialized  at  level  4  by  per¬ 
forming  the  global  sampling  of  the  error  measure  at  the  same  density  as  the  exper¬ 
iments  in  chapter  IV  (a  separation  of  ^  radians  in  the  coordinate  system  for  the 
direction  of  translation  sphere).  The  resulting  flow  field  is  shown  in  figure  29a.  The 
first  step  of  the  local  processing  was  initialized  at  the  minimum  determined  in  the 
global  sampling  and  used  a  stepsize  equal  to  0.1  radians  for  the  images  and  features 
at  level  5.  The  resulting  flow  field  is  shown  in  figure  29b.  At  level  6,  the  stepsize 
was  reduced  to  0.025  and  the  local  search  initialized  at  the  minimum  determined 
by  the  processing  done  at  level  5.  At  level  7,  the  stepsize  was  reduced  to  0.005  and 
the  search  was  initialized  at  the  minimum  determined  at  level  6.  5x5  windows  were 
used  at  each  level.  The  procedure  converged  to  the  same  results  as  in  experiment 
three  in  chapter  IV. 
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Figure  29a.  Image  displacements  at  16  x  16  resolution 


Figure  29b.  Image  displacements  at  32  x  32  resolution 


Some  Problems 


A  reasonable  change  to  the  procedure  described  here  would  be  the  use  of  band- 
passed  filtered  images  instead  of  the  smoothed  ones  used  here.  Work  by  Burt 
[Burt82]  and  Glazer  et.  al.  (Glaz83b]  indicates  that  the  matches  of  features  from 
successive  bandpassed  images  are  much  more  distinctive  than  using  features  from 
low-pass  images.  Another  important  question  which  has  not  been  addressed  in  any 
detail  concerns  the  image  level  at  which  to  begin  processing.  One  criteria  could 
be  the  level  at  which  significant  changes  in  image  values  occur  as  determined  by 
an  average  difference  value.  Another  could  be  the  response  of  the  error  function. 
This  would  involve  determining  the  level  at  which  the  error  function  has  a  distinct 
minimum. 

A  particular  problem  in  hierarchical  matching  schemes  occurs  at  occlusion 
boundaries.  Here,  features  on  different  sides  of  an  occlusion  boundary  can  have 
a  common  ancestor,  but  will  themselves  have  different  displacements.  Therefore, 
the  displacement  value  inherited  from  the  parent  may  be  incorrect  for  one  of  the 
features  and  that  feature  should  have  its  potential  displacements  re-evaluated  along 
it ’8  displacement  path.  A  possible  criterion  to  determine  the  need  for  re-evaluation 
of  the  displacements  of  a  feature  is  if  its  match  value  is  ever  less  than  some  threshold 
or  is  less  than  the  match  strength  of  its  parent.  It  may  be  sufficient  simply  to  not 
evaluate  such  features  if  they  are  found,  and  to  then  determine  their  displacements 
or  occlusion  after  the  more  certain  image  displacements  have  been  found  for  other 
image  points. 
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Translational  Blur  Path  Extraction 

Blur  streaks  are  commonly  produced  when  the  shutter  mechanism  of  a  camera 
remains  open  while  the  camera  is  moving  relative  to  a  textured  surface.  The  streaks 
are  produced  by  the  successive  positions  of  the  image  projections  of  the  texture 
elements.  Recent  work  [Harr80,  Shep83]  indicates  that  blur  streaks  may  be  a  very 
common  motion  effect  in  the  human  visual  system. 

For  translational  camera  motion,  the  blur  streaks  will  correspond  to  the  image 
displacement  paths:  straight  line  segments  radiating  from  a  common  intersection 
point.  In  the  analysis  of  translational  blur  paths,  some  information  is  lost  concerning 
the  direction  (from  an  FOE  or  towards  an  FOC)  and  magnitude  of  the  displacements 
of  image  points  over  time.  Nonetheless,  the  techniques  developed  in  chapter  IV 
can  be  easily  modified  for  the  extraction  of  translational  blur  paths.  First,  it  is 
necessary  to  compute  the  gradient  of  the  blurred  image.  The  image  gradient  will  be 
perpendicular  to  the  translational  blur  paths  at  positions  where  image  blur  occurs. 
Thus,  the  error  measure  can  be  expressed  as 

N 

£l|cos0,H  (15) 

1=1 

where  *  is  an  index  over  image  positions,  and  0,-  is  the  angle  between  the  im¬ 
age  gradient  at  point  i  and  the  translational  displacement  path  corresponding  to 
a  particular  translational  axis.  The  same  evaluation  techniques  can  be  used  for 
this  error  function  as  above,  except  that  there  is  no  need  to  distinguish  between 
FOEs  and  FOCs.  Thus,  the  evaluation  of  the  error  measure  need  only  occur  on  a 
hemisphere.  It  should  be  noted  that  a  variant  of  this  error  measure  can  be  used 
for  processing  translational  motion  sequences  for  which  image  displacements  have 
been  determined.  In  this  case,  the  image  displacement  vectors  will  lie  along  (not 


perpendicular  to)  the  correct  translational  displacement  paths.  The  corresponding 
error  measure  becomes  1-0  —  j  cos0,| . 

The  results  of  a  preliminary  experiment  are  shown  in  Figures  30-33.  Figure 
30  shows  an  image  taken  from  a  car  traveling  down  a  straight  road.  The  shutter 
was  kept  open  for  a  prolonged  exposure  and  blur  streaks  resulted  from  the  texture 
elements  in  the  nearby  tree.  Figure  31a-c  shows  the  gradient  magnitude  of  the  image 
and  its  normalized  row  and  column  components.  Figure  32a-b  show  intensity  and 
contour  plots  of  the  error  function  at  points  on  the  direction  of  translation  sphere 
roughly  corresponding  to  the  potential  positions  of  FOEs.  Darker  corresponds  to 
less  error  in  the  intensity  plot.  In  the  contour  plot,  a  “  •  "  is  used  to  indicate  the 
position  of  a  local  minima  of  the  error  function  and  a  “  +  "  is  used  to  indicate  the 
position  of  a  local  maxima.  The  error  function  is  unimodal  due  to  wrap  around  on 
the  direction  of  translation  sphere  because  the  FOEs  and  FOCs  along  a  particular 
line  of  translation  are  not  distinguished.  Figure  33  shows  the  set  of  translational 
blur  paths  that  were  determined. 
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Figure  33.  Determined  translational  blur  paths. 


It  may  be  useful  to  use  multiple  versions  of  the  same  image  sequence,  each 
formed  using  a  different  exposure  rate.  Those  formed  with  short  exposure  times 
would  have  very  little  blurring  and  their  gradients  would  correspond  to  static  edges. 
By  subtracting  the  images  formed  with  very  short  exposure  rates  from  those  formed 
during  the  same  interval  but  with  longer  exposure  rates,  it  may  be  possible  to 
suppress  edges  in  the  blurred  images  which  are  non-blur  related.  O'  more  general 
importance  in  such  a  representation  is  the  potential  ability  to  relate  blur  streaks  to 
the  displacements  of  features  extracted  from  the  static  images. 

The  extraction  of  translational  blur  paths  is  also  similar  to  the  extraction  of 
vanishing  points  and  lines  from  static  images.  The  same  procedure  can  be  applied, 


without  the  initial  extraction  of  edges:  the  determination  of  edges  can  occur  concur¬ 
rently  with  the  extraction  of  the  vanishing  point.  However,  vanishing  point  analysis 
is  typically  more  difficult  because  only  small  portions  of  the  image  are  rudely  orga¬ 
nized  with  respect  to  the  potential  vanishing  points.  Determination  of  these  areas, 
or  finding  a  way  not  to  have  the  ’noise’  from  the  rest  of  the  image  dominate  the 
analysis,  are  the  key  difficulties.  In  this  case,  the  error  measure  may  need  to  be 
extended  to  incorporate  information  concerning  edge  length  or  connectedness  along 
the  radial  paths  determined  by  a  particular  vanishing  point. 


The  procedure  developed  here  assumes  a  sensor  moving  relative  to  a  stationary 
environment,  or  a  single  object  moving  relative  to  a  stationary  sensor.  A  useful 
extension  would  allow  the  presence  of  multiple,  independently  moving  objects,  while 
maintaining  the  ability  to  determine  image  displacements  concurrently  with  the 
direction  of  translation.  There  are  at  least  three  techniques  which  could  make 
this  possible.  One  is  to  utilize  generalized  Hough  transform  techniques  [BallSl, 
0’Rou81]  for  decomposing  the  responses  in  a  error  measure  into  the  corresponding 
image  structures  or  segments.  The  other  two  constrain  the  analysis  to  independent 
limited  image  areas  over  which  the  procedure  can  successfully  function. 

We  begin  by  noting  that  the  global  component  of  the  optimization  process  used 
in  chapter  IV  is  an  instance  of  a  generalized  Hough  transform  in  which  each  feature 
scales  its  vote  against  a  particular  translational  axis  as  a  function  of  the  best  match 
it  can  find  that  is  consistent  with  the  translational  axis.  With  only  a  minor  change, 
instead  of  using  an  error  measure,  we  could  use  an  optimization  measure  by  which 
each  feature  scales  its  vote  for  a  particular  translational  axis  by  the  extent  of  the 
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best  match  it  can  find  that  is  consistent  with  the  axis.  The  problem  then  becomes 
a  typical  one  for  Hough  transforms:  how  to  associate  labels  corresponding  to  the 
resulting  peaks  in  the  histogram  with  image  points  or  features.  The  general  form 
of  this  processing  is  to  find  the  translational  axis  with  the  greatest  response  in 
the  histogram,  associate  a  label  with  it,  and  then  associate  this  label  with  image 
features  which  match  above  some  threshold  along  the  image  displacement  paths 
determined  by  the  corresponding  translational  axis.  The  resulting  set  of  features 
are  then  removed  and  a  new  histogram  is  produced.  The  peak  in  this  new  histogram 
and  the  process  is  repeated  until  there  are  no  more  distinct  peaks  in  the  resulting 
histograms,  or  all  image  features  are  labeled  [Adiv83]. 

This  procedure  will  have  difficulties  with  weak  or  homogeneous  feature  points 
which  have  strong  matches  consistent  with  several  distinct  translational  axes.  Thus, 
when  rehistogramming  occurs  it  is  necessary  to  establish  which  image  features  al¬ 
ready  labeled  are  consistent  with  the  newly  extracted  peak.  An  alternative,  is  to 
proceed  in  the  conventional  manner  and  determine  a  set  of  labels  corresponding  to 
translational  axes  for  which  there  is  evidence.  Each  feature  is  then  labeled  with 
each  translational  axis  from  this  set  with  which  it  is  consistent.  Note  that  a  given 
feature  could  have  several  labels.  A  unique  consistent  labeling  is  then  obtained 
by  using  other  information:  segmentation-grouping  using  other  image  attributes, 
depth  consistency  with  neighbors,  and  common  magnitude  of  image  displacements. 
Additionally,  this  disambiguation  can  occur  over  several  successive  images.  In  fact, 
a  potentially  significant  aspect  of  generalized  Hough  techniques  may  be  the  correla¬ 
tion  of  histograms  from  successive  instants  to  bring  out  structures  that  are  moving 
consistently. 

Two  basic  questions  have  to  be  addressed  in  this  use  of  Hough  techniques:  what 
is  the  required  density  of  translational  axes  in  the  transform  and  what  is  the  minimal 


match  threshold.  In  general,  the  higher  the  density,  the  better. 


An  alternative  approach  is  to  break  the  image  into  subparts  and  then  locally 
apply  the  procedure  to  associate  a  translational  axis  with  each  subpart.  In  one 
scheme,  this  would  be  done  using  regular  image  areas  (as  in  a  grid)  at  multiple 
levels  of  resolution.  Techniques  similar  to  this  are  used  in  chapter  seven  to  deter¬ 
mine  the  local  directions  of  environmental  motion.  In  another  scheme,  the  subparts 
are  determined  by  some  segmentation  procedure,  and  the  translational  axis  is  de¬ 
termined  from  image  features  within  or  lying  along  the  boundary  of  the  extracted 
segments.  Segments  for  which  the  error  function  response  is  indistinct  are  reseg¬ 
mented  or  their  features  are  associated  with  the  translational  axes  determined  for 
adjacent  image  subparts. 


Hybrid  Sensor  Systems 

Translational  processing  is  sufficient  for  vision-based  navigation  in  a  station¬ 
ary  environment  if  the  orientation  of  the  optic  sensor  can  be  fixed  relative  to  the 
environment  over  time.  In  this  case,  sensor  motion  amounts  to  a  sequence  of  trans¬ 
lations  in  possibly  different  directions  over  time.  There  has  been  much  recent  work 
on  sensor  stabilization,  notably  by  researchers  at  McDonnell  Douglass  Aerospace 
Corporation  in  suspending  electro-optical  systems  in  a  magnetic  field,  and  elsewhere 
using  more  conventional  gimbel-based  stabilization. 

A  difficulty  with  such  a  stabilized  retina  is  that  it  is  not  able  to  rotate  to  focus 
on  particular  parts  of  the  environment.  This  can  be  corrected  by  using  a  set  of 
such  stabilized  retinas  arranged  to  yield  a  complete  view  of  space.  There  would 
then  be  no  need  to  rotate  the  sensor  to  view  a  particular  environmental  point.  A 
possible  arrangement  of  retinal  surfaces  is  a  cubical  one.  One  of  the  retinal  planes 
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will  always  contain  an  FOE  and  another  will  always  contain  an  FOC  (unless  the 
direction  of  motion  is  right  on  an  edge  of  the  cube  and  the  focal  length  has  not 
been  properly  adjusted).  There  will  also  be  several  independent  estimates  of  the 
direction  of  translation  which  can  be  integrated.  Figure  34  shows  such  a  proposed 
arrangement  of  optic  sensors  attached  to  a  Cartesian  robot  manipulator  so  such  a 
complete,  stabilized  view  of  a  workspace  is  produced  at  all  times. 


I 


Figure  34.  Cartesian  Manipulator  with  attached  optic  devices. 


Alternatively,  if  the  sensor  cannot  be  stabilized,  there  are  other  devices  which  can 
at  least  determine  the  rotational  parameters  of  sensor  motion.  The  rotational  ef¬ 
fects  can  then  be  removed  from  successive  images,  reducing  them  to  translational 
sequences  which  can  be  processed  by  the  techniques  here.  A  particular  technology 
which  is  very  attractive  for  this  use  is  that  of  fiber  optic  rotation  sensors  [Ezek82j 
(figure  35).  These  sensors  are  expected  to  be  the  low-cost  gyroscope  of  the  near 


future  since  they  are  small,  cheap,  and  precise.  Because  they  have  no  moving  ele¬ 
ments,  they  are  not  as  affected  by  rapid  accelerations  as  conventional  gyroscopes. 
There  are  currently  slow  drift  problems  when  sensor  orientation  is  considered  over 
long  periods  of  time.  In  our  processing  though,  we  would  be  concerned  with  mea¬ 
surements  of  rotation  over  much  shorter  periods.  Additionally,  when  such  sensors 
are  coupled  with  an  image  processing  system  for  guidance  and  navigation,  the  ef¬ 
fects  of  such  long  term  drifts  could  be  recognized  and  accounted  for  by  noting  the 
position  of  specified  landmarks. 


Figure  35.  Layout  of  Fiber  Optic  Rotation  Sensor  (from  [Ezek82]). 


CHAPTER  VI 


PROCESSING  RESTRICTED  SENSOR  MOTION 


Introduction 


The  techniques  used  for  translational  motion  can  also  be  applied  to  other  cases 
of  restricted  motion.  The  issue  is  the  computational  feasibility  of  a  search  through 
a  subspace  of  sensor  motion  parameters  for  values  that  are  consistent  with  image 
feature  displacements.  In  this  chapter  we  briefly  consider  two  such  cases,  pure 
sensor  rotation  and  motion  constrained  to  a  known  plane. 

Processing  Pure  Sensor  Rotation 

For  processing  pure  sensor  rotation,  the  error  measure  can  again  be  defined  with 
respect  to  a  unit  sphere  with  each  point  corresponding  to  an  axis  and  a  direction 
of  rotation.  We  use  the  (^1,^2)  coordinate  system  from  chapter  IV  for  referring 
to  these  positions.  In  addition  to  these  two  parameters  for  specifying  an  axis  of 
rotation,  there  is  a  third  corresponding  to  the  extent  of  rotation.  The  extent  of 
rotation  is  defined  relative  to  the  orientation  of  a  given  axis  and  encoded  with 
positive  values  denoting  rotation  in  a  clockwise  direction.  Thus,  on  the  unit  sphere 
the  points  (x,y,z)  and  (—2 :,-y,-z)  will  lie  along  the  same  axis  of  rotation  but 
correspond  to  different  directions  of  rotation. 

As  in  the  case  of  translation,  we  utilize  the  error  of  matches  of  selected  features 
along  their  respective  image  displacement  paths.  However,  there  are  a  few  basic 
differences  with  the  translational  procedure.  First,  feature  displacements  are  not 
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measured  in  image  units,  but  in  the  extent  of  angular  displacement  about  the  axis 
of  rotation.  Second,  the  displacement  along  the  image  displacement  path  can  cause 
significant  reorientation  and  expansion  in  a  feature,  especially  for  large  rotations. 
For  this  reason,  each  pixel  of  the  feature  array  has  its  position  interpolated  inde¬ 
pendently  (figure  36).  If  motion  is  restricted  to  small  rotations  only,  this  may  not 
be  necessary. 

For  a  rotational  field,  the  extents  of  angular  displacements  for  all  the  features 
must  be  identical.  This  yields  a  constraint  which  can  be  incorporated  into  the  eval¬ 
uation  of  a  particular  set  of  rotational  parameters  in  different  ways.  The  evaluation 
can  be  done  as  in  the  translational  case  where  the  best  match  of  each  feature  along 
its  displacement  path  is  determined  independently  of  the  other  features.  This  re¬ 
sults  in  two  different  error  measures:  one  based  on  the  summed  error  values  of  the 
best  matches  and  the  other  based  on  the  variance  of  the  extent  of  displacements 
corresponding  to  these  matches.  Alternatively,  the  feature  displacement  determina¬ 
tion  can  be  restricted  such  that  they  all  evaluate  the  same  extent  of  displacements 
simultaneously. 

We  have  tried  these  three  error  measures  on  a  simple  image  pair  and  found  that 
they  all  give  roughly  the  same  result.  The  variance  of  the  extent  of  displacements 
was  minimized  at  the  correct  value,  but  was  very  jagged  and  rough.  The  summed 
error  values  for  the  best  matches  and  the  direct  3-D  search  were  very  smooth  and 
had  a  distinct  global  minimum  in  a  very  large  neighborhood. 


Figure  36.  Determining  Individual  Pixel  Displacements  of  a  Feature. 

Figure  37a  and  37b  show  successive  images  formed  with  the  image  generation 
system  MOVIE  BYU  and  are  referred  to  as  the  House  Sequence  1.  The  motion 
was  a  rotation  of  2  degrees  (0.035  radians)  about  the  (0,  — 1,0)  axis.  The  field  of 
view  was  45  degrees.  Image  contours  for  application  of  the  interest  operator  were 
determined  by  a  threshold  selection  algorithm  which  produces  boundaries  with  max¬ 
imum  average  contrast  [Kohl81j.  The  resulting  contour  and  the  extracted  features 
are  shown  in  figure  37c.  The  interesting  points  were  extracted  by  finding  the  local 
maxima  in  the  distinctiveness  measure  values  which  were  also  greater  than  a  mini¬ 
mal  threshold.  Both  the  features  and  the  neighborhoods  over  which  local  maxima 
were  determined  were  3x3  pixel  areas.  This  small  neighborhood  size  caused  the 
feature  extraction  process  to  be  sensitive  to  the  notches  along  the  contours  as  can 
be  seen  by  the  number  of  extracted  features  along  the  bush  boundary.  Figure  37d 
shows  the  displacements  determined  for  these  features. 


The  evaluation  of  the  error  measure  based  on  the  extent  of  feature  mismatch 
is  presented  as  in  chapter  IV  using  the  (<£1,^2)  values  in  two  tables.  The  first 
table  (table  6a)  basically  corresponds  to  those  axes  of  rotation  on  the  positive 
Z  portion  of  the  unit  sphere.  The  second  (table  6b)  basically  corresponds  to 
axes  on  the  negative  Z  portion.  Axes  for  which  Z  is  equal  to  zero  and  Y  is 
positive  are  represented  in  the  first  row  of  the  first  table  while  axes  for  which  Z 
is  equal  to  zero  and  Y  is  negative  are  represented  in  the  first  row  of  the  second 
table.  The  tables  are  shown  as  intensity  plots  in  which  darker  corresponds  to 
less  error  and  also  as  contour  plots  in  figures  38a  and  38b.  There  is  a  distinct 
global  minimum  at  the  position  corresponding  to  the  (0,  — 1,0)  axis.  Nearly  all 
the  features  had  displacements  corresponding  to  a  rotation  of  0.035  radians  for  this 
axis.  This  was  also  the  best  axis  and  extent  of  rotation  determined  by  the  local 
search  using  the  extent  of  feature  mismatch  for  features  restricted  to  evaluating  the 
same  displacements  simultaneously. 


c 


0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

0 

6.407 

6.966 

6.252 

5.557 

5.241 

5.384 

5.208 

5.252 

5.810 

6.377 

1 

7.004 

6.507 

5.961 

5.472 

5.479 

5.430 

5.577 

6.0% 

6.391 

2 

7.064 

6.985 

6.606 

6.387 

6.258 

6.285 

6.283 

6.364 

6.441 

3 

7.103 

7.229 

7.418 

7.725 

7.684 

7.359 

6.981 

6.593 

6.404 

4 

6.658 

7.081 

7.457 

7.940 

8.125 

8.006 

7.128 

6.539 

6.298 

r 

5 

6.294 

6.083 

6.008 

5.879 

6.009 

5.870 

5.862 

5.823 

6.057 

6 

5.985 

5.496 

5.109 

4.159 

3.465 

3.775 

4.809 

5.401 

5.676 

7 

5.666 

5.094 

4.189 

2.767 

1.586 

2.219 

3.913 

4.935 

5.492 

u 

8 

5.566 

4.753 

3.762 

2.347 

0.998 

1.966 

3.327 

4.632 

5.444 

9 

5.473 

4.485 

3.647 

2.170 

0.611 

1.944 

3.279 

4.399 

5.386 

■ 

Table  6a.  House  Sequence  1  Error  Values 
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otion  Constrained  to  a  Known  Plane 


If  motion  is  constrained  to  a  known  plane,  the  translational  axis  must  lie  on  a 
plane  perpendicular  to  the  rotational  axis  which  contains  the  focal  point.  Therefore, 
the  FOE/C  in  the  images  are  restricted  to  lie  along  the  line  determined  by  the 
intersection  of  this  plane  and  the  image  plane.  There  are  two  parameters  to  recover: 
the  extent  of  rotation  about  the  axis  that  is  perpendicular  to  the  plane  at  the  focal 
point,  and  the  position  of  the  translational  axis  in  this  plane.  Both  of  these  are 
expressed  as  angles:  By  for  the  extent  of  rotation  and  62  for  the  orientation  of  the 
translational  axis  (figure  39a). 


Figure  39a.  B\ ,  B2  parameters  for  describing  planar  motion. 


The  error  measure  for  this  case  combines  the  computation  for  rotation  and 
translation.  For  the  rotation  and  translation  corresponding  to  particular  {6 1,62) 


values,  a  feature  Is  first  positioned  along  its  rotational  displacement  path  using 
bilinear  interpolation  for  each  pixel  and  then  displaced  along  the  translational  dis¬ 
placement  path  at  equal  increments  to  determine  its  best  match.  As  in  translational 
processing,  the  interpolation  for  individual  pixels  is  not  performed  for  the  trans¬ 
lational  displacement  (figure  39b).  The  minimal  match  errors  for  each  feature  are 
then  summed.  The  error  function  in  this  case  can  be  thought  of  as  being  mapped 
on  a  cylinder  with  the  62  parameter,  corresponding  to  the  direction  of  translation, 
wrapping  around. 


Figure  39b.  Evaluation  of  image  displacements  corresponding  to  6\ ,  62  values. 


Figures  40a  and  40b  show  the  grass  sequence  1.  The  image  in  figure  40b  of 
sample  grass  texture  was  produced  from  figure  40a  by  rotating  0.1  radians  about 
the  (0,0, 1)  axis  and  then  translating  along  the  (0, 1,0)  axis.  Figure  41a  shows  50 
points  which  were  selected  at  random  from  image  positions  where  contrast  exceeded 
a  minimal  value.  Figure  41b  shows  the  displacements  determined  for  these  points. 
Figure  42a  shows  the  resulting  error  function  in  terms  of  0i  and  0 2  coordinates 
as  an  intensity  plot.  Figure  42b  shows  the  error  function  as  a  contour  plot  with 
“  -  ”  indicating  the  local  minima  and  “  +  ”  indicating  the  local  maxima.  0i 
ranges  from  *0.15  to  0.15  radians  in  0.01  radian  increments.  02  ranges  from  0.0 
to  2  x  tr  radians  with  0.0  corresponding  to  the  position  of  the  translational  axis  at 
(-1,0,0).  The  minimum  in  the  error  function  corresponded  to  the  correct  values  of 
the  rotation  and  translation. 


Figure  42a.  Intensity  plot  of  Error  Measure 


«» 
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Known  Planar  Motion  with  Determined  Image  Displacements 

To  process  known  planar  motion  for  image  sequences  for  which  image  displace¬ 
ments  have  been  computed,  we  use  the  error  measure  based  on  the  properties  of 
composite  image  motions  discussed  in  chapter  III  to  describe  the  consistency  of  a 
given  set  of  image  displacements  with  particular  values  of  6\  and  #2  ■ 

Referring  to  Figure  6b  in  chapter  III,  for  a  given  image  displacement  from  image 
point  to  /„, ,  its  consistency  with  particular  values  of  and  62  is  determined 
by  first  applying  the  rotation  specified  by  6 1  to  obtain  a  displacement  from  to 
Jmi  (figure  III.6b).  The  angle  between  the  vector  Jm,  —  /„,  and  the  translation^1 
displacement  path  line  determined  by  the  FOE/C  corresponding  to  62  and  Jmt 
reflects  the  degree  of  consistency.  We  actually  use  one  minus  the  cosine  of  this 
angle.  By  summing  these  values  for  a  set  of  image  displacements,  the  consistency 
of  the  entire  field  is  determined. 

This  procedure  has  to  be  extended  slightly  to  deal  with  pure  rotations.  In  this 
case,  the  difference  vector  between  the  image  displacement  vector  and  the  correct 
rotational  displacement  vector  will  be  quite  short  and  behave  erratically  with  re¬ 
spect  to  the  determination  of  the  angle  with  the  corresponding  translational  field 
line.  Pure  rotational  fields  have  two  properties  which  we  utilize  to  detect  their  oc¬ 
currence.  First,  when  rotational  fields  having  the  same  axis  but  different  extents  of 
rotation  are  subtracted  from  each  other,  the  variance  of  the  length  of  the  difference 
vectors  tends  to  be  small.  Secondly,  the  correct  rotational  field  will  minimize  av¬ 
erage  length  of  these  difference  vectors.  Thus,  a  purely  rotational  field  is  indicated 
when  the  variance  of  the  length  of  the  difference  vectors  is  small  with  respect  to 
one  of  the  rotational  fields  generated  by  the  axis  of  rotation  corresponding  to  the 
known  plane  of  motion,  or  the  average  length  of  the  difference  field  is  small.  The 


correct  extent  of  rotation  is  that  which  minimizes  the  total  length  of  the  difference 
vectors. 


Ambiguities  in  Planar  Motion 

We  have  noted  an  ambiguity  that  occurs  in  the  case  of  motion  constrained  to 
a  known  plane  when  the  focal  length  is  relatively  long  and  the  axis  of  rotation  is 
roughly  parallel  to  the  image  plane.  In  this  case,  the  rotational  component  field 
is  very  similar  to  a  translational  field  with  the  FOE/C  at  infinity  in  the  image 
plane.  The  extent  of  displacements  are  also  nearly  identical.  The  effect  of  this  is  to 
displace  the  translational  component  by  some  amount  proportional  to  the  direction 
and  extent  of  rotation.  As  a  result,  the  composite  field  looks  like  a  translational  field 
which  could  result  from  a  wide  range  of  translations  and  compensating  rotations 
(figure  43).  The  effect  of  this  on  the  error  measure  is  a  trough  of  low  error  values. 

Figures  44a-b  are  successive  images  formed  using  MOVIE  BYU  and  are  referred 
to  as  House  Sequence  2.  44a  is  identical  to  37a  while  44b  was  generated  by  translat¬ 
ing  along  the  (0,0,1)  axis  after  the  rotation  shown  in  images  37a  and  37b.  Figures 
45a  and  45b  show  the  error  measure  with  9\  ranging  from  -0.05  to  0.05  radians  and 
62  ranging  from  0.0  to  2  x  n .  The  trough  of  low  error  values  is  apparent. 


Discussion 


All  of  the  extensions  discussed  for  translational  processing  -  hierarchical  pro¬ 
cessing,  blur  path  extraction,  independently  moving  objects  —  should  be  directly 
applicable  to  the  pure  rotational  case.  There  are  some  specific  differences  how¬ 
ever.  The  blur  path  extraction  is  more  complex  in  the  rotational  case  because 
the  structure  of  the  image  displacement  paths  are  conics  instead  of  straight  lines; 
the  necessary  expression  for  the  tangents  to  the  image  displacement  paths  in  the 
rotational  case  were  derived  in  chapter  III.  While  independently  moving  objects 
may  not  frequently  move  in  trajectories  corresponding  to  rotation  about  an  axis 
positioned  at  the  focal  point,  there  is  a  related  phenomena  which  may  be  of  some 
use  in  decomposing  arbitrary  motion.  The  image  displacements  of  very  distant, 
stationary  objects  or  environmental  features  (like  the  horizon,  the  moon,  the  stars) 
will  primarily  be  a  reflection  of  the  effects  of  the  rotational  sensor  motion.  Thus,  if 
image  features  whose  displacements  are  dominated  by  rotational  motion  could  be 
detected,  the  rotational  parameters  could  be  extracted,  the  image  corrected,  and 
the  translational  parameters  inferred  by  the  procedures  in  chapter  four. 

These  extensions  should  also  be  applicable  to  the  case  of  pure  planar  motion 
though  with  some  complications.  The  blur  paths  are  more  difficult  to  characterize 
in  the  planar  case.  The  error  function  response  also  seems  to  have  large  flat  areas 
which  would  especially  affect  the  processing  of  planar  motion  in  restricted  portions 
of  an  image.  Finally,  the  cases  for  which  planar  motion  is  ambiguous  would  be 
serious  for  any  of  the  discussed  extensions  and  may  require  processing  over  several 
frames. 


CHAPTER  VH 


THE  LOCAL  TRANSLATIONAL  DECOMPOSITION 


Introduction 


In  this  chapter  we  utilize  the  procedure  for  translational  motion  to  process  im¬ 
age  sequences  produced  by  other  classes  of  restricted  and  arbitrary  sensor  motion. 
This  is  accomplished  via  application  of  the  translational  procedure  to  small  image 
areas.  This  approximates  more  general  motion  as  an  array  of  local  environmental 
translations,  and  interprets  local  image  motions  as  if  they  resulted  from  transla¬ 
tional  motion  of  the  corresponding  portions  of  the  environment.  The  feasibility  of 
this  approach  was  demonstrated  in  chapter  IV  where  the  direction  of  translation 
was  extracted  with  reasonable  precision  from  small  image  areas  containing  a  few 
features.  The  resulting  description  of  motion  is  an  approximation  to  what  we  term 
the  Environmental  Direction  of  Motion  Field  (EDMF)  which  associates  with  a  set 
of  image  points  (or  small  image  areas)  the  relative  direction  of  motion  of  the  cor¬ 
responding  environmental  points  (or  small  environmental  surface  areas).  This  is  a 
low  level  representation  of  environmental  motion  which  considerably  simplifies  the 
recovery  of  the  sensor  motion  parameters. 

This  chapter  consists  of  four  parts.  The  first  considers  computing  the  Environ¬ 
mental  Direction  of  Motion  Field  when  image  displacement  vectors  have  or  have 
not  been  initially  computed.  The  second  section  describes  EDMF  properties  for 
different  cases  of  sensor  motion.  In  the  third  section,  these  properties  of  the  lo¬ 
cal  translational  decomposition  are  used  to  process  image  sequences  produced  by 
sensor  motion  constrained  to  an  unknown  plane  in  textured  environments.  In  the 
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fourth  section,  we  develop  a  set  of  equations  for  environmental  depth  inferences 
from  image  displacements  based  upon  an  assumption  of  environmental  rigidity.  We 
then  show  how  these  equations  may  be  solved  using  the  EDMF. 


Computing  the  Environmental  Direction  of  Motion  Field 

The  Environmental  Direction  of  Motion  Field  (EDMF)  is  a  low  level  description 
of  environmental  motion  which  associates  with  each  feature,  or  small  image  area,  a 
three  dimensional  unit  vector  describing  the  direction  of  motion  of  the  correspond¬ 
ing  feature  (or  small  surface  area)  in  the  environment  relative  to  the  observer.  In 
the  continuous  case,  the  EDMF  can  be  thought  of  as  a  description  of  environmental 
motion  where  only  the  orientations  of  tangents  along  the  environmental  displace¬ 
ment  paths  are  known.  We  consider  first  how  to  compute  the  EDMF  and  then  how 
it  can  be  used  to  recover  sensor  motion  parameters  and  environmental  depth. 


Analysis  of  Raw  Image  Sequences 

The  procedure  for  translational  motion  described  in  chapter  IV  yields  a  set  of 
image  displacements  consistent  with  a  determined  translational  axis.  Application 
of  this  procedure  to  a  small  area  of  an  image  containing  extracted  features  will  yield 
a  set  of  image  displacements  consistent  with  an  interpretation  of  the  local  image 
motion  as  a  relative  translation  of  that  corresponding  part  of  the  environment.  Note 
that  where  the  translational  approximation  is  poor  there  will  be  a  large  value  of  the 
error  measure  reflecting  the  weaker  confidence  in  the  validity  of  the  approximation. 
It  is  also  necessary  to  incorporate  information  concerning  the  number  and  distribu¬ 
tion  of  the  feature  points  in  the  local  image  areas  for  this  evaluation.  For  example, 


if  there  is  only  one  feature  in  a  small  area  or  the  features  are  bunched  together,  then 
the  translational  approximation  would  be  suspect.  The  further  processing  of  the 
EDMF  should  not  utilize  local  areas  which  do  not  have  satisfactory  characteristics. 

This  use  of  the  translational  procedure  can  be  seen  as  a  local  constraint  on  the 
determination  of  image  displacements.  Typically,  most  such  constraints  are  based 
upon  smoothness  of  the  resulting  displacement  field  [Barn80,  Glaz81,  Horn80], 
where  image  displacements  are  computed  under  the  constraint  of  being  a  local 
average  of  the  displacements  in  their  surrounding  neighborhood.  In  our  case,  image 
displacements  are  determined  such  that  the  corresponding  environmental  motion 
can  be  interpreted  locally  as  being  translational.  Note  that  this  constraint  does  not 
necessarily  imply  local  smoothness  in  the  displacement  field. 

Computing  the  EDMF  from  raw  image  sequences  depends  upon  how  the  images 
are  divided  into  subareas.  The  image  could  be  divided  into  small,  regular,  square 
subareas  across  the  image  and  the  procedure  for  determining  the  axis  of  translation 
is  applied  to  each  subarea  independently.  Alternatively,  the  procedure  could  be 
applied  to  individual  regions  determined  by  some  segmentation  procedure.  In  our 
work  to  date,  we  have  used  another  approach  in  which  the  image  subareas  are  neigh¬ 
borhoods  centered  on  single  features  and  the  computation  is  applied  independently 
over  the  neighborhood  of  each  feature. 

Computing  the  EDMF  can  be  expensive  for  such  feature-based  neighborhoods 
since  the  feature  displacements  of  many  points  are  being  determined  simultaneously 
for  different,  overlapping,  image  subareas.  An  approximation  is  used  to  simplify 
this  computation.  For  each  feature,  its  best  match  and  corresponding  displacement 
along  each  of  a  set  of  radial  directions  are  determined  from  one  image  into  the  next. 
These  values  are  then  stored  in  a  1  -  D  array  where  each  index  corresponds  to  a 
particular  radial  direction  centered  at  the  feature  and  the  associated  best  match 
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value  for  the  corresponding  direction  (figure  46). 
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Figure  46.  Approximating  Match  Values  Along  Translational  Flow  Paths. 
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This  set  of  values  is  then  used  for  all  the  translational  computations  employing 
this  feature  in  its  various  neighborhoods.  To  determine  the  value  of  a  particular 
translational  axis  with  respect  to  a  the  neighborhood  of  a  feature,  each  feature  in 
the  neighborhood  finds  its  best  match  along  the  direction  closest  to  that  determined 
by  the  translational  axis  and  the  resulting  values  are  then  summed  up.  In  this  way, 
redundant  evaluations  of  feature  matches  are  avoided. 

Figures  47a-b  are  referred  to  as  the  Grass  Sequence  2.  Figure  47a  is  a  128x128 
pixel  image  of  some  grass  texture  with  seven  bits  of  intensity.  Figure  47b  was 
derived  from  figure  47a  by  applying  a  rotation  of  0.1  radians  about  the  Y  axis  of 
the  camera  coordinate  system  described  in  chapter  HI.  The  focal  length  was  set 
to  one  and  bilinear  interpolation  was  used.  Features  were  selected  from  the  image 
in  figure  47a  by  determining  image  points  where  the  contrast  was  greater  than 
20  intensity  levels  and  which  were  also  local  maxima  in  the  distinctiveness  values 
associated  with  5x5  pixel  square  features  centered  at  those  points.  The  resulting 
feature  positions  are  shown  in  figure  48. 

The  direction  of  translation  was  determined  for  11x11  pixel  neighborhoods  cen¬ 
tered  at  each  feature  in  figure  48.  Each  feature  determined  its  best  displacements 
in  256  evenly  spaced  directions  for  distances  of  up  to  10  pixels.  The  image  dis¬ 
placement  associated  with  a  feature  was  the  displacement  that  was  consistent  with 
the  FOE/C  determined  by  the  translational  approximation  for  the  feature’s  neigh¬ 
borhood.  The  resulting  image  displacement  field  is  shown  in  figure  49.  As  can  be 
seen  from  the  discussion  in  chapter  III,  it  has  the  correct  form  for  rotational  motion 
about  the  Y  -axis. 

Figure  50a-c  show  the  ( X ,  Y,  Z)  components  of  the  EDMF  for  the  corresponding 
image  points.  The  values  in  the  EDMF  are  between  1.0  and  —1.0  since  it  consists  of 
unit  vectors.  Note  that  all  the  features  have  displacements  in  the  same  X  direction 
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(Figure  50a)  because  the  camera  rotation  about  Y  induces  all  points  to  move  left 
or  right.  The  Y  displacements  were  all  very  close  to  zero  (consistent  with  motion 
constrained  to  planes  parallel  to  the  Y  -axis).  The  mean  Y  displacement  was  -0.003 
(figure  50b).  The  Z  components  are  positive  for  the  right  half  and  negative  for  the 
left  half  of  the  image  (figure  50c.  The  scale  of  the  display  has  also  been  increased). 
This  motion  occurs  in  pure  rotation  about  Y  because  the  environmental  motions 
lie  on  circular  paths  with  one  side  going  away  from  the  observer  and  the  other  side 
going  towards  the  observer. 
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measure  from  chapter  IV  discussed  in  the  section  on  the  processing  of  translational 
blur  paths.  The  error  associated  with  a  particular  translational  axis  is  a  function 
of  the  angles  between  the  image  displacement  paths  determined  by  the  FOE  and 
the  image  displacement  vectors.  The  function  employed  is  the  sum  of  one  minus 
the  cosine  of  each  such  angle,  £2|V(1.0  -  cos  0,).  To  compute  the  EDMF,  the 
translational  axis  is  determined  by  applying  this  error  measure,  minimized  as  in 
chapter  IV,  to  local  areas  of  a  computed  displacement  field. 

Figure  51  shows  a  32x32  image  displacement  field  produced  using  a  spherical 
distribution  of  environmental  points  about  the  Z  -axis.  The  observer  is  looking 
into  the  interior  of  a  sphere  with  noise  modulation  added  to  the  depth  values  of 
the  points  in  this  figure.  This  noisy  sphere  was  rotated  0.1  radians  about  an  axis 
tangent  to  a  point  on  the  back  of  it  along  the  (1,1,1)  axis.  Note  that  this  field  was 
generated  by  an  axis  of  rotation  that  was  not  positioned  at  the  origin  of  the  camera 
coordinate  system.  Each  image  point  was  the  center  of  a  5x5  neighborhood  over 
which  the  translational  procedure,  using  the  adapted  error  measure,  was  applied. 
Figure  52a-c  show  the  X ,  Y ,  Z  components  of  the  computed  EDMF  and  the 
correct  EDMF,  encoded  as  intensity  with  —1  being  darkest,  1  the  brightest  and 
the  neutral  gray  intensity  along  the  border  is  0 .  Figure  53  shows  the  values  of 
the  error  of  the  translational  approximation.  Note  how  the  approximation  is  poor 
where  the  field  has  a  rotational  character  with  vectors  at  very  different  orientations 
in  a  small  area. 
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Figure  51.  Simulated  Flow  Field. 
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Figure  52c.  Computed  Y  Component  of  the  EDMF. 


Figure  52d.  Correct  Y  Component  of  the  EDMF. 
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Computing  the  EDMF  From  Sparse  Flow  Fields 

It  may  be  possible  to  compute  the  EDMF  from  sparse  displacement  fields  by 
applying  an  interpolation  process  [Glaz83b,  Grim81,  Terz82,  Terz83]  to  produce 
a  field  of  adequate  density  and  then  applying  the  techniques  above.  Some  initial 
experiments  have  been  performed  to  test  this  possibility,  and  they  have  shown  a 
correlation  between  field  density  and  the  reliability  of  the  approximation.  The 
primary  difficulty  with  very  sparse  fields  is  that  the  interpolation  processes  produce 
large  areas  of  parallel  displacements  about  the  given  image  displacement  vectors 
upon  which  the  interpolation  is  based.  This  resulting  flow  field  can  be  very  different 
than  the  actual  flow  field  from  which  the  points  were  sampled,  and  therefore  result 
in  a  poor  approximation  to  the  actual  EDMF. 

EDMF  Properties  for  Different  Cases  of  Motion 

To  describe  EDMF  properties  for  different  cases  of  motion,  it  is  useful  to  map 
all  the  EDMF  vectors  onto  the  direction  of  translation  sphere.  In  Chapter  IV, 
the  direction  of  translation  sphere  was  used  as  the  domain  of  the  error  measure. 
Here  it  is  used  in  a  manner  similar  to  a  histogram.  Each  EDMF  vector  votes  for 
a  particular  point  on  the  direction  of  translation  sphere.  Processing  then  involves 
finding  certain  patterns  in  the  distribution  of  the  EDMF  vectors. 


EDMF  Properties  of  Pure  Translational  Motion 


As  discussed  previously  the  image  displacement  paths  for  translational  motion 
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are  straight  lines  intersecting  at  a  point.  The  environmental  displacement  paths  are 
straight,  parallel  lines.  All  the  vectors  in  the  EDMF  are  identical  and  map  onto  a 
single  point  on  the  direction  of  translation  sphere  corresponding  to  the  translational 
axis. 


EDMF  Properties  of  Pure  Rotational  Motion 

For  pure  rotational  motion  of  the  camera,  the  image  displacement  paths  are 
conic  sections  determined  by  the  intersection  of  the  image  plane  with  the  nested 
family  of  cones  aligned  with  the  axis  of  rotation  based  at  the  origin  of  the  camera 
coordinate  system.  The  environmental  displacement  paths  are  circles  about  the 
axis  of  rotation  and  are  contained  in  planes  perpendicular  to  it.  When  mapped 
onto  the  direction  of  translation  sphere,  the  EDMF  vectors  will  lie  upon  a  great 
circle  contained  in  a  plane  perpendicular  to  the  axis  of  rotation. 


EDMF  Properties  of  Motion  Constrained  to  an  Unknown  Plane 

For  this  case,  the  environmental  displacement  paths  are  circles  in  planes  per¬ 
pendicular  to  the  axis  of  rotation,  but  the  axis  does  not  necessarily  contain  the 
origin  of  the  coordinate  system  (see  the  discussion  of  kinematics  in  chapter  1  of 
[Whit44]).  As  for  the  rotational  case,  the  EDMF  vectors  will  lie  on  a  great  circle 
in  a  plane  perpendicular  to  the  axis  of  rotation  when  mapped  onto  the  direction  of 
translation  sphere. 
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EDMF  Properties  of  Arbitrary  Motion 

For  arbitrary  motion,  the  image  displacement  paths  cannot  be  easily  described. 
However,  the  environmental  displacement  paths  are  helices  about  an  axis  which  does 
not  necessarily  contain  the  origin  (since  a  screw  displacement  is  the  most  general 
form  of  a  rigid  body  motion  [Coxe61,  Whit44]). 

The  set  of  normalized  tangent  vectors  to  a  helix,  when  based  at  a  common 
origin,  will  generate  a  cone  which  we  term  the  tangent  cone.  The  orientation  of 
this  cone  specifies  the  axis  of  rotation.  The  set  of  tangent  cones  determined  by  a 
rigid  body  motion  for  all  points  in  space  will  all  have  the  same  orientation.  Note 
that  the  difference  vectors  between  any  vectors  of  a  tangent  cone  will  lie  in  a  plane 
perpendicular  to  the  axis  of  rotation.  Thus,  the  EDMF  produced  during  arbitrary 
motion  has  a  particularly  nice  property  if  the  rigid  body  motion  is  constant  over 
two  or  more  intervals.  For  such  motion  there  will  be  successive  environmental  direc¬ 
tion  of  motion  vectors  associated  with  each  image  point,  and  the  difference  vectors 
between  these  successive  EDMF  vectors  will  lie  in  the  same  plane,  perpendicular  to 
the  axis  of  rotation,  for  all  image  points. 

In  general,  by  mapping  the  EDMF  onto  the  direction  of  translation  sphere, 
the  local  differential  properties  of  the  EDMF  are  not  being  utilized.  Such  things 
as  the  extent  of  rotation  can  be  recovered,  or  at  least  strongly  constrained,  by 
analyzing  the  local  changes  in  the  orientation  of  the  EDMF  vectors  either  spatially 
(over  a  small  area  of  an  image)  or  temporally  (over  successive  inter-image  intervals). 
Consider  the  case  where  the  parameters  of  motion  remain  constant  over  successive 
intervals.  Here,  the  angle  between  the  successive  EDMF  vectors  associated  with 
an  image  point  will  be  equal  to  the  angle  of  rotation.  This  angle  will  be  the  same 
for  all  points  in  the  image  sequence  and  suggests  a  potentially  robust  technique  for 
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determining  the  extent  of  rotation  by  finding  the  mean  angle  between  successive 
EDMF  vectors.  For  a  single  EDMF  and  image  displacement  field,  this  technique 
could  be  extended  by  predicting  the  EDMF  vector  for  a  point  in  the  next  interval 
by  interpolating  the  value  in  the  EDMF  at  the  position  determined  by  the  head  of 
the  image  displacement  vectors. 

Processing  of  Motion  Constrained  to  an  Unknown  Plane 

The  EDMF  produced  by  motion  constrained  to  an  unknown  plane  leads  to  a 
particularly  simple  algorithm.  For  this  case  there  is  one  constraint  on  the  inference 
of  sensor  motion  parameters:  the  axis  of  rotation  is  perpendicular  to  the  axis  of 
translation.  This  corresponds  to  inferring  four  independent  parameters:  the  ro¬ 
tational  axis,  the  extent  of  rotation  and  the  position  of  the  translational  axis  in 
the  plane  perpendicular  to  the  axis  of  rotation.  All  of  the  EDMF  vectors  are  con¬ 
strained  to  lie  in  a  plane  which  is  parallel  to  the  plane  of  environmental  motion. 
By  calculating  the  EDMF  vectors  and  fitting  a  plane  to  them,  the  plane  of  motion 
and  thus  the  axis  of  rotation  can  be  recovered.  If  the  motion  occurs  over  several 
successive  instants  and  remains  constrained  to  the  same  plane,  then  the  vectors 
in  i he  successive  EDMFs  are  also  constrained  to  lie  in  a  plane  parallel  to  it  and 
containing  the  origin  on  the  direction  of  translation  sphere.  Thus,  more  and  more 
values  for  the  fit  can  be  collected  over  time,  thereby  increasing  the  accuracy  of  the 
processing.  The  extent  of  rotation  can  then  be  recovered  by  techniques  for  pro¬ 
cessing  motion  restricted  to  a  known  plane  described  in  chapter  VI.  The  processing 
is  further  simplified  since  the  image  displacements  have  already  been  computed  or 
were  determined  from  computing  the  EDMF. 

The  best  planar  fit  to  the  EDMF  vectors  can  be  found  using  any  of  a  number 


of  plane  fitting  routines.  In  the  experiments  here,  an  eigenvector  fit  procedure  (de¬ 
scribed  in  [Duda73]  pp.  332-335)  is  used,  having  been  adapted  for  planes  containing 
the  origin.  Once  the  plane  of  motion  is  determined,  the  algorithm  for  processing 
known  planar  motion  from  a  computed  displacement  field  is  used.  We  now  consider 
some  examples. 

The  grass  sequence  2  from  this  chapter  involving  pure  rotation  is  a  case  of 
motion  constrained  to  a  plane  since  the  environmental  displacement  paths  all  lie 
in  planes  perpendicular  to  the  axis  of  rotation.  Using  the  EDMF  determined  for 
the  grass  texture  sequence  described  above,  the  normal  to  the  best  plane  fit  was 
(.003, .999, -.014).  This  is  in  error  by  .015  radians,  or  .836  degrees,  from  the  correct 
rotational  axis. 

Using  all  the  EDMF  vectors  determined  for  the  flow  field  in  figure  51  in  the 
plane  fitting  procedure,  the  normal  to  the  plane  of  motion  is  determined  to  be 
(.647,  .544,  .534).  This  deviates  from  the  correct  axis  by  .089  radians  or  5.078 
degrees.  This  fit  can  be  improved  by  removing  vector?  from  the  EDMF  for  which  the 
corresponding  local  FOE/C  yields  a  large  error,  and  therefore  a  poor  translational 
approximation.  For  the  EDMF  vectors  computed  from  the  flow  field  in  figure 
51,  the  error  value  is  equal  to  the  sum  of  the  angles  between  the  flow  vectors 
in  each  5x5  neighborhood  over  which  the  EDMF  vector  was  determined  and  the 
displacement  paths  corresponding  to  the  translational  axis  which  minimized  the 
error  measure.  We  can  thus  express  the  validity  of  a  computed  EDMF  vector  by  the 
sum  of  these  deviation  angles.  Figure  53  shows  the  error  values  in  the  translational 
fit  proportional  to  image  darkness.  Note  that  the  greatest  errors  occur  where  the 
image  displacement  vectors  have  a  rotational  character.  By  restricting  the  planar 
fit  to  EDMF  vectors  for  which  the  sum  of  the  deviation  angles  corresponds  to  less 
than  some  threshold  (90  degrees  in  this  example)  of  error  relative  to  the  determined 
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translational  field  lines  over  the  5x5  pixel  neighborhoods,  the  normal  is  determined 
to  be  (.579462,  .583347,  .569148).  This  deviates  by  .010380  radians  or  .594798 
degrees  from  the  correct  rotational  axis.  Thus,  the  high  error  measure  values  have 
been  used  to  remove  the  rotational-like  displacements  in  the  center  of  the  image. 
The  error  histogram  derived  from  the  flow  field  in  figure  51,  assuming  motion  to  be 
constrained  to  this  plane,  is  shown  in  figure  54a  and  54b.  In  the  contour  plot  (figure 
54b)  a  “  -  ”  indicates  a  local  minimum  and  a  “  +  ”  indicates  a  local  maximum. 
The  correct  rotation  was  selected  from  the  histogram;  (the  rotational  parameter 
was  varied  from  -0.15  to  0.15  radians  in  0.1  radian  increments).  The  determined 
rotational  field  is  shown  in  figure  55a  and  the  translation  field  which  results  from 
subtracting  the  determined  rotational  field  from  the  original  displacement  field  is 
shown  in  figure  55b. 


Environmental  Inference  via  EDMF  and  Rigidity  Constraints 


A  basic  paradigm  in  computer  vision  is  to  take  an  environmental  property 
and  express  it  in  terms  of  the  constraints  it  imposes  on  resulting  image  structures 
[Barr81].  These  constraints  are  then  expressed  as  equations  whose  solution  deter¬ 
mines  an  interpretation  of  image  events  consistent  with  the  assumed  environmental 
properties.  In  this  section,  we  utilize  the  constraint  of  environmental  rigidity  to 
derive  a  set  of  equations  whose  solution  determines  a  set  of  environmental  depths 
that  are  consistent  with  given  image  displacements.  We  show  the  conditions  under 
which  solutions  to  these  equations  are  possible  [Lawt80,  Meir80,  Ullm79,  Webb81] 
for  general  motion  and  how  these  conditions  are  affected  for  restricted  cases  of  mo¬ 
tion.  We  then  show  how  the  equations  for  unrestricted  motion  are  significantly 
simplified  when  information  concerning  the  direction  of  environmental  motion  is 
also  utilized. 


Development  of  Rigidity  Constraints 

For  this  development,  we  refer  to  the  camera  model  described  in  chapter  HI. 
Equation  1  from  chapter  III  can  be  used  transform  expressed  relations  between  en¬ 
vironmental  points  into  a  set  of  equations  in  terms  of  image  position  vectors  and 
unknown  Z  values  which  correspond  to  the  environmental  depth  values.  Solutions 
to  the  resulting  equations  yield  a  set  of  Z  values  which  provide  a  consistent  in¬ 
terpretation  over  time  for  the  positions  of  the  corresponding  set  of  environmental 
points. 


The  basic  relation  for  interpreting  environmental  motion  is  the  assumption  of 
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rigidity  which  reflects  the  invariance  of  distance  between  environmental  points  dur¬ 
ing  motion.  For  two  points  *  and  j  on  a  rigid  body  at  times  m  and  n ,  this 
preservation  of  distance  is  expressed  as 


l|fmi  -  r-yll  =  Ilf..-  -  P.j II 


(16) 


which  can  be  expanded,  by  using  the  substitution  specified  by  equation  1  from 
chapter  HI  and  squaring  both  sides,  into  the  image-based  equation 


2  mi ( Am  '  Am)  +  '  Anj) 

—  2 Zmi 2mj( Im{  ■  Imj)  —  •  Ini) 

~2n jVnj  '  Ifij)  ~b  %2ni2nj{Ini  •  /„ j)  —  0 


(17) 


where  the  inner-product  terms  in  parentheses  are  constants  determined  from  the 
positions  of  image  points.  To  determine  a  solution,  we  will  find  the  minimum 
number  of  points  and  frames  for  which  the  number  of  independent  constraints  (in 
the  form  of  equation  17)  equals  or  exceeds  the  number  of  unknown  Z  values.  It  is 
then  necessary  to  solve  the  resulting  set  of  simultaneous  equations.  Note  that  each 
such  constraint  is  a  second  degree  polynomial  in  4  unknowns. 

We  begin  with  the  number  of  unknown  Z  values.  For  N  points  in  K  frames 
(where  N  >  2  and  K  >  1 ),  there  are  {NK  -  1)  unknown  Z  values.  The  decrease 
by  one  in  the  number  of  unknowns  reflects  the  loss  of  absolute  scale  information. 
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Thus,  one  of  the  Z  -values  can  be  set  to  an  arbitrary  value  which  can  be  recovered 
from  the  actual  sensor  displacement  if  such  absolute  measurements  are  available. 

The  number  of  rigidity  const  raints  generated  by  a  set  of  N  points  in  K  frames 
is  the  product  of  3  x  ( N  -  2)  and  ( K  —  1) .  The  first  term  is  the  minimum  number 
of  unique  distances  which  must  be  specified  between  pairs  of  points,  in  a  body  of 
N  points  with  no  three  points  being  collinear,  to  assure  its  rigidity.  Thus,  4  points 
require  6  pairwise  distances  (all  that  are  possible).  For  configurations  of  more  than 
4  points,  it  is  necessary  to  specify  the  distance  of  each  additional  point  to  only 
3  other  points  to  assure  rigidity.  The  second  term  is  the  number  of  interframe 
intervals,  with  each  interval  providing  a  set  of  additional  constraining  points.  Each 
distance  specified  must  be  maintained  over  each  interframe  interval. 

A  solution  is  possible  when  the  number  of  constraints  is  greater  or  equal  to  the 
number  of  unknowns.  This  occurs  when: 


2  NK  -  6K  -  3iV  +  7  >  0 


(18) 


Thus,  minimal  solutions  can  be  found  when  N  =  5  and  K  =  2 ,  producing  nine 
constraint  equations  or  when  N  =  4  and  K  =  3  producing  12  constraint  equations. 

Rigidity  Constraints  Applied  to  Known  Planar  Motion  As  one  would  expect, 
the  rigidity  constraints  are  simplified  by  adding  restrictions  on  allowable  motions 
of  environmental  points.  For  example,  consider  motion  constrained  to  a  plane. 
For  simplicity,  we  will  assume  that  it  is  parallel  to  the  XZ  plane  of  the  camera 
coordinate  system,  but  an  appropriate  transformation  can  be  applied  so  that  the 
results  are  valid  for  motion  constrained  to  an  arbitrarily  oriented,  but  known,  plane. 
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Here,  the  Y  component  of  an  environmental  point  is  assumed  to  remain  constant 
over  time.  For  a  point  i  at  times  m  and  n,  this  is  expressed  as 

Vmi  z  ^ni^ni  ~  Uni  (19) 

and  solving  for  z„,  yields 

%ni  —  (20) 

This  allows  a  substitution  for  points  *  and  j  in  equation  17  which  simplifies  (at 
least  in  terms  of  the  number  of  unknowns)  the  rigidity  constraint  to 

4 •  /_.•)  -  •  /.<)) 

*ni 

°nj 

+*-.-W*((r£)(r£)(^  ■  M  -  (/~  ■  /.<))) 

Pfll  vnj 

=  0  (21) 

The  planarity  constraint  has  removed  two  unknowns.  Note  that  the  bracketed 
expressions  are  again  constants  that  can  be  determined  from  the  locations  of  the 
image  points.  This  equation  can  be  solved  given  two  points  in  two  frames.  Thus,  for 


points  i  and  j  at  times  m  and  n  with  the  corresponding  unknown  depth  values 

zmi ,  zmj  >  zni  i  znj  i  equation  21  reduces  these  to  a  system  of  2  unknowns,  zm,-  and 

zmj .  One  of  these  variables,  say  zmt- ,  can  be  set  to  an  arbitrary  value,  reflecting 
scale  independence,  allowing  zmj  to  then  be  determined  by  solving  the  quadratic 
in  terms  of  zmf- . 

Rigidity  Constraints  Applied  to  Translational  Motion  The  constraint  imposed 
by  translational  motion  of  points  »  and  j  on  a  rigid  body  at  times  m  and  n  is 
expressed  by 


Pm.  ~  Pmj  =  Pni  ~  P, 


"J 


(22) 


which  is  similar  to  equation  16  except  the  operation  is  vector  subtraction  reflecting 
the  preservation  of  length  and  orientation  under  translation.  Setting  zm,  to  a 
constant  value  1 ,  to  reflect  scale  independence  in  equation  22,  yields  3  simultaneous 
linear  equations  in  3  unknowns 


(am«>  ^mii  1)  —  zmj{&mji  & mjt  1)  d"  zni(,anit  ^n»>  1)  znj{&n]i  ji  0  (23) 


Thus,  not  surprisingly,  environmental  inference  from  translation  requires  2  points 
in  2  frames. 


Solving  the  Rigidity  Constraints  using  the  EDMF 

The  rigidity  constraints  can  be  significantly  simplified  when  they  are  integrated 
with  information  concerning  the  environmental  direction  of  motion  from  the  local 
translational  decomposition.  To  do  this  the  EDMF  is  used  first  to  find  consistent 
relative  depths  for  single  points  over  successive  images.  Consistent  relative  depths 
for  several  points  are  then  determined  by  scaling  the  particular  depth  values  for  the 
individual  points  using  the  rigidity  constraint. 


Figure  56.  Relative  Depths  for  a  point  over  time  from  the  EDMF. 
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We  first  examine  the  use  of  the  EDMF  in  the  determination  of  consistent  relative 
depths  for  a  single  point  over  time.  Consider  the  image  position  vectors  7mt-  and 
/„,•  (for  the  successive  image  positions  of  point  *  at  times  m  and  n)  and  the 
environmental  direction  of  motion  associated  with  point  i  at  time  m,  7£m, .  (Figure 
56).  Assuming  the  ideal  case,  in  which  there  is  no  error  in  any  of  these  quantities, 
the  EDMF  vector  Emi  will  lie  in  the  plane  determined  by  7mt-  and  7„,- .  Thus,  given 
a  depth  zm,  along  the  ray  of  projection  corresponding  to  7m,- ,  one  can  find  a  depth 
value  z„i  along  the  ray  of  projection  associated  with  I„i  from  the  intersection  of 
the  lines  7>m,  4-  tEm ,•  and  z„,-7ni- .  In  the  usual  case  of  error  in  these  measurements, 
these  lines  will  not  intersect  because  they  are  skewed  in  three  dimensions.  In  these 
instances  we  can  solve  for  the  line  segment  which  is  perpendicular  to  both  of  these 
lines.  Let  us  express  the  point  along  the  ray  of  projection  determined  by  7n,-  which 
is  closest  to  the  line  determined  by  the  point  Pm,-  =  zmt-7m,-  and  the  direction  of 
motion  Em .  from  the  EDMF: 


( {zmilmi  "b  (^ni'7nt))  '  Emi  —  0 

((2mi^mi  "b  tEmi)  ~  (z„,7nl) )  '  Ini  ~  ® 


which  simplifies  to 


^{^mi  '  E/mi)  ^nii^mi  '  7nl)  —  2mt  ( I  mi  '  Emi) 
'  Ini )  —  zni{Ini  '  Ini )  =  ~zmi{Imi  '  Ini ) 


These  equations  can  be  expressed  in  terms  of  the  ratio  of  the  relative  distances 
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along  the  successive  rays  of  projection  consistent  with  the  environmental  direction 
of  motion  Emi  (and  treating  t  as  a  dummy  variable) 


t{Pmi  •  Emi)  rmni{Emi  '  At)  —  (Ant  "  Et mi' ) 


t(Amt  '  Alt')  ^mn  t  (  A  i  '  At)  (Ant  '  At) 


(26) 


where 


^mni 


frm 

*nt 


This  yields  the  relative  depths  of  a  single  point  over  time.  We  now  use  the 
rigidity  constraint  to  detetrmine  the  appropriate  scaling  of  each  of  these  ratios  for 
all  of  the  points. 

Assume  we  have  two  points  *  and  j  at  times  m  and  n .  Let  be  set  to  an 
arbitrary  value.  Then,  z„y  may  be  obtained  by  the  product  zmjxrmnj  =  zmjx 
where  the  ratio  rmny  is  obtained  through  the  relation  expressed  in  equation  26. 
This  yields  the  environmental  points  Pm} ■  and  Pnj  ■  We  can  now  use  the  rigidity 
constraint  to  determine  a  scale  factor  expressing  Pm,  =  zm,  Am'  and  P„,  =  z„,  A.  = 
Zmirmni  A«  in  terms  of  Pmj  and  Pnj  Substitution  into  the  rigidity  constraint  yields 


* mil  mi  Anjll 


\%tniY  i 


mrmnHnt 


A.  -  P, 


nj\ 


(27) 


where  zm,  is  the  scale  factor.  Equation  27  can  be  expanded  as 


173 


(28) 

The  resulting  equation  is  quadratic  in  one  unknown.  Thus,  given  successive  depth 
values  determined  for  a  particular  point  from  its  EDMF  vector,  consistent  depths 
can  be  determined  for  every  other  paur  of  successive  depth  values  by  solving  this 
equation  for  each  resulting  pair  of  points. 

In  summary,  given  a  flow  field  and  an  EDMF,  a  pair  of  depth  values  for  each 
image  point  at  successive  instants  m  and  n  can  be  found  which  are  consistent 
with  the  determined  EDMF  vectors  describing  motion  from  time  m  to  n .  These 
are  relative  depth  values,  and  hence  may  be  scaled  arbitrarily  and  inferred  from 
equation  26.  Once  these  relative,  successive  depth  values  are  determined  for  each 
point,  they  may  then  be  scaled  relative  to  a  selected  point  whose  depth  is  arbitrarily 
set  by  solving  equation  28  for  each  point  paired  with  this  selected  point.  There  is  • 

a  great  deal  of  redundancy  for  optimization  procedures  to  exploit.  Several  depth 
maps  can  be  computed  (one  for  each  selected  image  point)  and  the  certainty  of  a 
particular  depth  inference  would  be  based  upon  agreement  in  the  relative  depth 
values  in  all  the  resulting  depth  maps.  If  there  are  further  spatial  constraints,  such 
as  motion  relative  to  a  planar  surface,  all  the  determined  depth  maps  would  have  to 
be  in  agreement,  with  respect  to  the  shape.  For  example,  all  the  determined  depth 
maps  for  a  plane  would  have  to  correspond  to  a  single  plane  at  the  same  orientation. 

This  work  shows  that  if  the  EDMF  can  be  reliably  computed,  it  is  a  very  useful 
low  level  representation  for  rigid  body  motion  analysis.  This  is  p<  ssible  for  densely 


2mt((^rr>t  '  An«)  irmni^ni  '  rmni4i)) 

+  ((Pmj  Pmj)  ~  (Pnj  - Pnj)) 

"2 ‘  Pmj)  ~  {rmni^ni  '  P nj'))  =  0 
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textured  image  sequences  for  which  the  camera  motion  parameters  to  be  recovered 
correspond  to  motion  constrained  to  an  unknown  plane.  The  local  translational 
decomposition  may  also  be  applicable  to  inferring  qualitative  descriptions  of  non- 
rigid  motions  by  noting  certain  patterns  in  the  relative  directions  of  motion  as  would 
typify  such  motions  as  expanding  or  twisting. 


SUMMARY  AND  FUTURE  WORK 


We  summarize  the  major  contributions  of  this  thesis  and  many  of  the  questions 
it  raises  for  further  study.  We  shall  conclude  with  a  consideration  of  two  major  areas 
for  future  research  that  are  intimately  related  to  motion  processing:  architectures 
for  real-time  processing  and  image  interpretation  in  the  domain  of  dynamic  road 
scenes. 


Summary 

The  review  of  work  in  dynamic  image  processing  in  chapter  II  stressed  a  basic 
problem  in  motion  research.  There  has  been  a  discrepancy  between  the  precision 
and  reliability  with  which  image  displacements  can  be  determined  and  the  sensi¬ 
tivity  of  the  environmental  and  sensor  motion  inference  procedures  to  such  noise 
and  resolution  errors.  In  addition,  there  are  open  questions  about  the  stability  of 
the  inference  procedures  themselves.  We  noted  that  this  has  limited  the  practical 
applications  of  dynamic  image  processing  in  domains  where  its  use  is  fundamental. 

In  chapter  IV  we  developed  a  procedure  for  processing  translational  motion. 
The  most  important  feature  of  this  procedure  is  that  the  determination  of  the  im¬ 
age  displacements,  the  direction  of  sensor  motion,  and  environmental  depth  are 
combined  into  a  single,  mutually  constraining  computation.  The  procedure  consists 
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of  two  basic  steps:  Feature  Extraction  and  Search.  The  feature  extraction  pro¬ 
cess  finds  small  image  areas  which  may  correspond  to  distinguishing,  and  therefore 
trackable,  parts  of  environmental  objects.  The  direction  of  translational  motion 
is  then  found  by  a  search  which  minimizes  an  error  measure  defined  over  a  unit 
sphere,  with  each  point  on  the  sphere  corresponding  to  a  different  direction  of  sen¬ 
sor  translation.  A  given  direction  of  translation  constrains  the  motion  of  extracted 
image  features  to  straight  lines  which  radiate  from  or  converge  onto  a  single  point 
in  the  image  plane.  Thus,  the  error  measure  associates  a  point  on  the  unit  sphere, 
corresponding  to  a  particular  translational  axis,  with  a  number  describing  the  de¬ 
gree  of  total  feature  mismatch  along  the  set  of  displacement  paths  determined  by 
the  translational  axis.  Experience  has  shown  this  error  measure  to  be  smooth  and 
with  a  distinct  minimum  in  a  large  neighborhood  about  the  correct  translational 
axis.  This  allows  simple  search  procedures  to  be  effective.  Experiments  were  pre¬ 
sented  which  indicated  that  the  algorithm  was  robust  in  a  variety  of  ways.  It  could 
function  effectively  with  weak  or  false  features,  with  a  small  numbers  of  features, 
and  even  with  a  small  number  of  features  in  limited  portions  of  an  image. 

Many  extensions  and  possible  areas  of  further  work  were  also  discussed,  and 
we  mention  two,  here,  that  are  of  particular  interest.  First,  the  procedure  should 
be  developed  to  extend  over  multiple  frames.  The  determined  translational  axis, 
image  displacements,  and  environmental  depth  values  should  be  used  to  constrain 
further  processing  and  feature  extraction  in  a  manner  that  will  allow  refinement 
in  the  accuracy  of  sensor  motion  parameters  and  the  environmental  depth  map. 
Second,  a  theoretical  formulation  is  necessary  to  develop  a  more  complete,  analytical 
understanding  of  the  robustness  of  the  procedure. 

In  chapter  V  we  considered  other  extensions  to  the  translational  procedure  in¬ 
cluding  its  embodiment  as  a  hierarchical  computation;  processing  translational  blur 
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paths;  dealing  with  multiple  independently  translating  objects;  and  using  the  trans¬ 
lational  procedure  for  autonomous  vehicle  control  by  having  a  stabilized  sensor  or 
associated  devices  to  determine  the  rotational  parameters.  The  hierarchical  exten¬ 
sion  was  found  to  significantly  increase  the  speed  of  the  procedure,  since  it  reduces 
the  number  of  feature  correlations  that  are  necessary  along  potential  translational 
displacement  paths.  There  are  still  a  variety  of  alternatives  to  be  investigated 
before  the  most  effective  implementation  of  the  hierarchical  computation  will  be 
thoroughly  understood.  We  showed  that  the  processing  of  translational  blur  paths 
could  be  performed  by  a  simple  extension  of  the  error  measure  used  in  chapter  IV. 
The  extensions  discussed  for  multiple,  independently  moving  objects  were  based 
upon  the  similarity  of  the  translational  procedure  to  generalized  Hough  transforms 
and  the  limited  image  areas  necessary  for  the  procedure  to  function.  Finally,  the 
incorporation  of  the  procedure  with  sensor  stabilization  and  rotational  displacement 
sensing  devices  has  exciting  implications  for  passive-sensing  based  autonomous  ve¬ 
hicles. 

In  chapter  VI  we  successfully  processed  other  simple  cases  of  restricted  motion, 
pure  sensor  rotation  and  motion  constrained  to  a  known  plane,  for  which  it  was 
computationally  feasible  to  search  .  .rough  the  subspace  of  the  sensor  motion  pa¬ 
rameters  for  values  that  are  consistent  with  image  feature  displacements.  For  pure 
sensor  rotation  the  dimensionality  of  the  search  increased  over  the  translational 
case,  but  was  compensated  for  by  the  additional  constraint  that  the  extents  of  all 
feature  displacements  were  identical.  We  noted  a  typical  case  of  planar  motion, 
quite  common  to  terrestrial  motion,  which  is  inherently  ambiguous. 

In  chapter  VII  we  showed  how  to  process  sensor  motion  by  applying  the  pro¬ 
cedure  for  translational  motion  to  local  areas  of  images.  This  yields  a  low  level 
description  of  motion  that  we  termed  the  Environmental  Direction  of  Motion  Field 
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(EDMF)  which  associated  a  relative  direction  of  environmental  motion  between 
features  from  restricted  image  subareas  and  the  sensor.  We  showed  how  to  pro¬ 
cess  the  case  of  motion  constrained  to  an  unknown  plane  using  the  constraint  that 
all  the  EDMF  vectors  are  constrained  to  lie  in  this  plane.  This  constraint  forms 
the  basis  of  a  robust  computation  to  recover  the  parameters  of  sensor  motion  in 
this  case.  We  discussed  the  recovery  of  the  parameters  of  sensor  motion  from  the 
EDMF  for  general  sensor  motion.  We  developed  the  rigidity  constraints  which  ex¬ 
press  the  inference  of  environmental  depth  from  displacement  fields  by  exploiting 
the  preservation  of  object  rigidity  during  motion.  We  showed  that  these  constraints 
were  directly  solvable  for  restricted  cases  of  motion  and  that  this  was  also  possible 
for  arbitrary  motion  when  information  from  the  EDMF  was  incorporated  with  the 
rigidity  constraints. 

There  are  several  aspects  of  the  work  in  chapter  VII  that  require  further  explo¬ 
ration.  The  processing  of  unrestricted  motion  should  be  evaluated  with  respect  to 
the  required  accuracy  of  the  set  of  direction  vectors  in  the  EDMF.  It  may  be  possi¬ 
ble  to  derive  qualitative  inferences  more  robustly.  This  is  also  related  to  the  way  in 
which  the  EDMF :  <  computed.  We  investigated  only  one  of  the  techniques  that  were 
discussed,  the  case  where  image  subareas  are  centered  on  individual  features.  In 
another  of  the  suggested  techniques,  the  subareas  are  formed  by  dividing  the  image 
into  regular,  nonoverlapping  subareas  and  applying  the  translational  procedure  over 
each  of  these.  In  this  case,  the  EDMF  would  not  be  associated  with  a  particular 
environmental  point,  but  with  a  larger  environmental  area,  thereby  reducing  the 
resolution  in  the  EDMF. 


Direct  solutions  to  the  rigidity  constraints  should  also  be  studied  further,  since 
our  formulation  of  the  rigidity  constraints  was  developed  some  years  ago  [Lawt80] 
and  was  not  explored  beyond  noting  that  the  equations  were  tractable  using  simple 


iterative  optimization  techniques  and  that  the  solutions  were  multimodal  in  the 
cases  of  minimal  numbers  of  points  and  image  frames.  What,  for  example,  are  the 
effects  of  using  multiple  images  and  a  greater  number  of  points?  Additionally,  there 
has  been  interest  in  using  optimization  procedures  based  on  simulated  annealing 
[Kirk83]  to  solve  these  equations.  These  techniques  have  shown  an  ability  to  deal 
with  multimodal  error  surfaces  in  very  high  dimensional  spaces. 


Future  Work 

Architectures  for  Translational  Motion  Processing 

The  translational  procedure  that  we  have  developed  offers  an  attractive  possi¬ 
bility  for  real-time  implementation  of  a  motion  processing  system.  The  architecture 
is  a  straightforward  design  consisting  of  multiple  independent  processors,  each  as¬ 
sociated  with  a  unique,  disjoint  set  of  features.  Each  processor  determines  the 
displacement  and  extent  of  error  for  its  features  along  the  translational  displace¬ 
ment  paths  specified  by  a  given  FOE/C.  The  processors  are  then  coordinated  by 
a  global  search  executive  which  specifies  a  particular  FOE/C,  sums  up  the  error 
responses  of  the  multiple  processors,  and  determines  which  translational  axis  to  be 
evaluated  next.  The  critical  parameters  for  effective  implementation  are  the  speed 
with  which  a  feature’s  displacement  can  be  determined  along  its  displacement  path 
by  its  associated  processor  and  the  number  of  times  the  error  function  must  be 
evaluated  to  determine  the  translational  axis  to  sufficient  accuracy.  Experiments 
with  the  translational  procedure  indicate  that,  outside  of  pathological  cases,  fewer 
than  SO  evaluations  of  the  error  function  will  be  sufficient  and  even  fewer  when  the 
translational  axis  has  been  initialized  by  previous  processing.  Preliminary  timing 


studies  using  Motorola  68000  processors  (10  megaherts  minor  cycle  time)  to  deter¬ 
mine  feature  displacements  indicate  that  the  necessary  processing  rates  are  feasible 


Research  often  advances  by  the  stimuHting  problems  that  are  found  in  a  wisely 
chosen  task  domain.  The  VISIONS  system  [Hans78]  used  outdoor  house  scenes  as  a 
guiding  incentive  to  develop  procedures  and  representations  necessary  for  complex 
imagery.  A  domain  that  we  feel  would  be  challenging,  yet  one  in  which  achiev¬ 
able  results  would  be  possible,  is  the  interpretation  of  outdoor  road  scenes  along 
highways  and  country  roads  as  seen  from  a  moving  vehicle.  This  domain  is  quite 
tractable  under  assumptions  consistent  with  a  variety  of  the  algorithms  presented 
in  this  thesis.  The  assumed  constraints  might  include  the  vehicle  constrained  to 
translational  motion  or  constrained  to  a  plane;  a  stabilized  sensor  or  knowledge  of 
the  rotational  parameters;  sensor  and  object  motions  constrained  to  slowly  changing 
translations;  or  motion  of  independently  moving  objects  constrained  to  a  roughly  de¬ 
termined  plane.  This  domain  forces  us  to  address  interesting  questions  such  as  how 
to  achieve  dynamic  segmentations  using  the  temporal  behavior  of  complex  image 
structures  over  time,  the  incorporation  of  object-specific  semantics  into  recognition 
using  environmental  depth  and  image  motion  information,  and  predicative  process¬ 
ing  from  a  model  which  is  established  by  temporally  extended  inferences.  Thus,  a 
whole  new  set  of  issues  arise  as  a  full  road  scene  interpretation  system  is  developed 
which  integrates  motion  and  static  interpretation  into  a  goal  oriented  perceptual 
system  in  a  dynamic  environment. 
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